ClearSep
Unleashing the Power of Natural Audio Featuring Multiple Sound Sources

Abstract

Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose \textbf{ClearSep}, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios.

NEW Results!

Filtered Results

Filtered results with different metric. Comparing different filter methods.

View Samples

More Data Engine Samples.

Separation results corresponding to different Re-SDR and Re-SISDR.

View Samples

Demo Sections

AudioCaps Dataset

Explore caption- and class-queried sound separation on AudioCaps.

View Samples

Data Engine

Discover how ClearSEP processes and separates mixed-source audios.

View Samples

Silence Analysis

Separated tracks when there is no target signal in the audio.

View Samples

Real-World Samples

Experience the real-world audio separation performance.

View Samples

Filtered results with different metric. NEW!


Samples incorrectly passed by the CLAP score filter but strictly filtered out by the remix-based filter.

Mixture Class1 Class2 More Details
Source Spectrogram

"youtube id:1kjnqM-ptrk"

AudioSEP Spectrogram

"class: Speech"

AudioSEP Spectrogram

"class: Shatter"

"Re-SDR:7.912"

"Re-SISDR:6.499"

"Failure Type (CLAP score): Ineffectiveness in handling partial mismatches"

Source Spectrogram

"youtube id:zyGjrJfE_rg"

AudioSEP Spectrogram

"class: Speech"

AudioSEP Spectrogram

"class: Silence"

"Re-SDR:14.177"

"Re-SISDR:13.411"

"Failure Type (CLAP score): Lack of generalization on low-resource categories"


Low-resource category samples mistakenly filtered out by the CLAP score filter.

Mixture Class1 Class2 More Details
Source Spectrogram

"youtube id:Yq-532qrgyUA"

AudioSEP Spectrogram

"class: Speech"

AudioSEP Spectrogram

"class: Groan"

"Re-SDR:36.447"

"Re-SISDR:36.307"

"Failure Type (CLAP score): Lack of generalization on low-resource categories"

Source Spectrogram

"youtube id:zyGjrJfE_rg"

AudioSEP Spectrogram

"class: Speech"

AudioSEP Spectrogram

"class: Buzzer"

"Re-SDR:41.094"

"Re-SISDR:40.560"

"Failure Type (CLAP score): Lack of generalization on low-resource categories"

Samples of different Re-SDR and Re-SISDR. NEW!

Remix-based Metric Mixture Class1 Class2

"Re-SDR: 27.547, Re-SISDR: 27.017"

"ytb id:Zmv5RK2kwuI"

Source Spectrogram
AudioSEP Spectrogram

"Music"

CLAPSEP(P) Spectrogram

"Ping"

"Re-SDR: 23.079, Re-SISDR: 22.859"

"ytb id:ZdxLr4GaGpU"

Source Spectrogram
AudioSEP Spectrogram

"Speech"

CLAPSEP(P) Spectrogram

"Thunk"

"Re-SDR: 18.160, Re-SISDR: 16.897"

"ytb id:nyR4ONkiUaU"

Source Spectrogram
AudioSEP Spectrogram

"Rain"

CLAPSEP(P) Spectrogram

"Thunderstorm"

"Re-SDR: 8.690, Re-SISDR: 6.042"

"ytb id:csOhQS2y8D0"

Source Spectrogram
AudioSEP Spectrogram

"Speech"

CLAPSEP(P) Spectrogram

"Scissors"

"Re-SDR: 8.933, Re-SISDR: 5.478"

"ytb id:xtIAZRLTHR0"

Source Spectrogram
AudioSEP Spectrogram

"Speech"

CLAPSEP(P) Spectrogram

"Drip"

AudioCaps Dataset Demos

Caption-Queried Sound Separation Requires Query Refinement

Text Query Mixture Interference Target AudioSep CLAPSEP ClearSep

"Some rustling followed by a quick powerful hiss."

Mixture Spectrogram
Interference Spectrogram
Target Spectrogram
AudioSep Prediction
CLAPSEP Prediction
ClearSound Prediction

Label-Queried Sound Separation Automated Query Generation

Text Query Mixture Interference Target CLAPSEP(P) ClearSep(P) ClearSep(P+N)

"Door"

Mixture Spectrogram
Interference Spectrogram
Target Spectrogram
CLAPSEP Prediction
ClearSound Prediction
ClearSound Prediction

Data Engine Separated Tracks

Mixture Class1 Class2 Re-mixed Audio
Mixture Spectrogram
Class1 Spectrogram

"Speech"

Class2 Spectrogram

"Fire Alarm"

Re-mixed Spectrogram

Re-SDR: 35.847 dB Re-SISDR: 35.704 dB

Silence Separation Analysis

Query for Not Exist Signal Source AudioSEP CLAPSEP(P) CLAPSEP(P+N) ClearSEP(P) ClearSEP(P+N)

"Timpani"

(Not Exist in Audio)

Source Spectrogram
AudioSEP Spectrogram
CLAPSEP(P) Spectrogram
CLAPSEP(P+N) Spectrogram
ClearSEP(P) Spectrogram
ClearSEP(P+N) Spectrogram

Real-world Audio Samples

Query Source AudioSEP CLAPSEP(P) CLAPSEP(P+N) ClearSEP(P) ClearSEP(P+N)

"Speech"

"ytb id:-mNbbwgUZtQ"

Source Spectrogram
AudioSEP Spectrogram
CLAPSEP(P) Spectrogram
CLAPSEP(P+N) Spectrogram
ClearSEP(P) Spectrogram
ClearSEP(P+N) Spectrogram