"Re-SDR:7.912"
"Re-SISDR:6.499"
"Failure Type (CLAP score): Ineffectiveness in handling partial mismatches"
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio, which is critical for artificial auditory perception. However, current methods heavily rely on artificially mixed audio for training, which limits their ability to generalize to naturally mixed audio collected in real-world environments. To overcome this limitation, we propose \textbf{ClearSep}, an innovative framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks, thereby allowing effective sound separation in real-world scenarios. We introduce two remix-based evaluation metrics to quantitatively assess separation quality and use these metrics as thresholds to iteratively apply the data engine alongside model training, progressively optimizing separation performance. In addition, we propose a series of training strategies tailored to these separated independent tracks to make the best use of them. Extensive experiments demonstrate that ClearSep achieves state-of-the-art performance across multiple sound separation tasks, highlighting its potential for advancing sound separation in natural audio scenarios.
Filtered results with different metric. Comparing different filter methods.
View SamplesSeparation results corresponding to different Re-SDR and Re-SISDR.
View SamplesMixture | Class1 | Class2 | More Details |
---|---|---|---|
![]() "youtube id:1kjnqM-ptrk" |
![]() "class: Speech" |
![]() "class: Shatter" |
"Re-SDR:7.912" "Re-SISDR:6.499" "Failure Type (CLAP score): Ineffectiveness in handling partial mismatches" |
![]() "youtube id:zyGjrJfE_rg" |
![]() "class: Speech" |
![]() "class: Silence" |
"Re-SDR:14.177" "Re-SISDR:13.411" "Failure Type (CLAP score): Lack of generalization on low-resource categories" |
Mixture | Class1 | Class2 | More Details |
---|---|---|---|
![]() "youtube id:Yq-532qrgyUA" |
![]() "class: Speech" |
![]() "class: Groan" |
"Re-SDR:36.447" "Re-SISDR:36.307" "Failure Type (CLAP score): Lack of generalization on low-resource categories" |
![]() "youtube id:zyGjrJfE_rg" |
![]() "class: Speech" |
![]() "class: Buzzer" |
"Re-SDR:41.094" "Re-SISDR:40.560" "Failure Type (CLAP score): Lack of generalization on low-resource categories" |
Remix-based Metric | Mixture | Class1 | Class2 |
---|---|---|---|
"Re-SDR: 27.547, Re-SISDR: 27.017" "ytb id:Zmv5RK2kwuI" |
![]() |
![]() "Music" |
![]() "Ping" |
"Re-SDR: 23.079, Re-SISDR: 22.859" "ytb id:ZdxLr4GaGpU" |
![]() |
![]() "Speech" |
![]() "Thunk" |
"Re-SDR: 18.160, Re-SISDR: 16.897" "ytb id:nyR4ONkiUaU" |
![]() |
![]() "Rain" |
![]() "Thunderstorm" |
"Re-SDR: 8.690, Re-SISDR: 6.042" "ytb id:csOhQS2y8D0" |
![]() |
![]() "Speech" |
![]() "Scissors" |
"Re-SDR: 8.933, Re-SISDR: 5.478" "ytb id:xtIAZRLTHR0" |
![]() |
![]() "Speech" |
![]() "Drip" |
Text Query | Mixture | Interference | Target | AudioSep | CLAPSEP | ClearSep |
---|---|---|---|---|---|---|
"Some rustling followed by a quick powerful hiss." |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Text Query | Mixture | Interference | Target | CLAPSEP(P) | ClearSep(P) | ClearSep(P+N) |
---|---|---|---|---|---|---|
"Door" |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Mixture | Class1 | Class2 | Re-mixed Audio |
---|---|---|---|
![]() |
![]() "Speech" |
![]() "Fire Alarm" |
![]() Re-SDR: 35.847 dB Re-SISDR: 35.704 dB |
Query for Not Exist Signal | Source | AudioSEP | CLAPSEP(P) | CLAPSEP(P+N) | ClearSEP(P) | ClearSEP(P+N) |
---|---|---|---|---|---|---|
"Timpani" (Not Exist in Audio) |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Query | Source | AudioSEP | CLAPSEP(P) | CLAPSEP(P+N) | ClearSEP(P) | ClearSEP(P+N) |
---|---|---|---|---|---|---|
"Speech" "ytb id:-mNbbwgUZtQ" |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |