Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection

Sangwon Lee, Hyemi Kim, Gil Jin Jang

Research output: Contribution to journalArticlepeer-review

3 Scopus citations

Abstract

Featured Application: Audio classification; music information retrieval; audio scene characterization; temporal localization of sound sources; audio indexing; audio surveillance systems; anomaly detection from audio sounds. Sound event detection (SED) is the task of finding the identities of sound events, as well as their onset and offset timings from audio recordings. When complete timing information is not available in the training data, but only the event identities are known, SED should be solved by weakly supervised learning. The conventional U-Net with global weighted rank pooling (GWRP) has shown a decent performance, but extensive computation is demanded. We propose a novel U-Net with limited upsampling (LUU-Net) and global threshold average pooling (GTAP) to reduce the model size, as well as the computational overhead. The expansion along the frequency axis in the U-Net decoder was minimized, so that the output map sizes were reduced by 40% at the convolutional layers and 12.5% at the fully connected layers without SED performance degradation. The experimental results on a mixed dataset of DCASE 2018 Tasks 1 and 2 showed that our limited upsampling U-Net (LUU-Net) with GTAP was about 23% faster in training and achieved 0.644 in audio tagging and 0.531 in weakly supervised SED tasks in terms of F1 scores, while U-Net with GWRP showed 0.629 and 0.492, respectively. The major contribution of the proposed LUU-Net is the reduction in the computation time with the SED performance being maintained or improved. The other proposed method, GTAP, further improved the training time reduction and provides versatility for various audio mixing conditions by adjusting a single hyperparameter.

Original languageEnglish
Article number6822
JournalApplied Sciences (Switzerland)
Volume13
Issue number11
DOIs
StatePublished - Jun 2023

Keywords

  • U-Net
  • pooling
  • sound event detection
  • weakly supervised learning

Fingerprint

Dive into the research topics of 'Weakly Supervised U-Net with Limited Upsampling for Sound Event Detection'. Together they form a unique fingerprint.

Cite this