TY - GEN
T1 - Waveform-based End-to-end Deep Convolutional Neural Network with Multi-scale Sliding Windows for Weakly Labeled Sound Event Detection
AU - Lee, Seokjin
AU - Kim, Minhan
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/2
Y1 - 2020/2
N2 - In this paper, a waveform-based end-to-end sound event detection algorithm that detects and classifies sound events using a deep convolutional neural network architecture is proposed. While most machine-learning-based acoustic signal processing systems utilize hand-crafted feature vectors e.g. log-Mel spectrogram, end-to-end methods, which utilize raw input data, have recently been investigated for use in various applications. Therefore, we develop an end-to-end architecture for sound event detection tasks with convolutional neural networks. The proposed model consists of multi-scale time frames and networks that handle both short and long signal characteristics; the frame slides by 0.1 second to provide a sufficiently fine resolution. The element network for each time frame consists of several one-dimensional convolutional neural networks with a deeply stacked structure. The results of the element networks are averaged and gated by sound activity detection. In order to handle unlabeled data, the trained networks are enhanced using the mean-teacher model. A decision is made via double thresholding, and the results are enhanced using class-wise minimum gap/length compensation. To evaluate our proposed approach, simulations are performed with development data from DCASE 2019 Task 4, and the results show that the proposed algorithm had a macro-averaged F1 score of 31.7% for the DCASE 2019 development dataset, 30.2% for the DCASE 2018 evaluation dataset, and 26.7% for the DCASE 2019 evaluation dataset.
AB - In this paper, a waveform-based end-to-end sound event detection algorithm that detects and classifies sound events using a deep convolutional neural network architecture is proposed. While most machine-learning-based acoustic signal processing systems utilize hand-crafted feature vectors e.g. log-Mel spectrogram, end-to-end methods, which utilize raw input data, have recently been investigated for use in various applications. Therefore, we develop an end-to-end architecture for sound event detection tasks with convolutional neural networks. The proposed model consists of multi-scale time frames and networks that handle both short and long signal characteristics; the frame slides by 0.1 second to provide a sufficiently fine resolution. The element network for each time frame consists of several one-dimensional convolutional neural networks with a deeply stacked structure. The results of the element networks are averaged and gated by sound activity detection. In order to handle unlabeled data, the trained networks are enhanced using the mean-teacher model. A decision is made via double thresholding, and the results are enhanced using class-wise minimum gap/length compensation. To evaluate our proposed approach, simulations are performed with development data from DCASE 2019 Task 4, and the results show that the proposed algorithm had a macro-averaged F1 score of 31.7% for the DCASE 2019 development dataset, 30.2% for the DCASE 2018 evaluation dataset, and 26.7% for the DCASE 2019 evaluation dataset.
KW - convolutional neural network
KW - end-to-end
KW - sound event detection
KW - waveform
KW - weakly supervised
UR - http://www.scopus.com/inward/record.url?scp=85084046466&partnerID=8YFLogxK
U2 - 10.1109/ICAIIC48513.2020.9064985
DO - 10.1109/ICAIIC48513.2020.9064985
M3 - Conference contribution
AN - SCOPUS:85084046466
T3 - 2020 International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2020
SP - 182
EP - 186
BT - 2020 International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2nd International Conference on Artificial Intelligence in Information and Communication, ICAIIC 2020
Y2 - 19 February 2020 through 21 February 2020
ER -