TY - JOUR
T1 - Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation
AU - Kim, Han Gyu
AU - Jang, Gil Jin
AU - Oh, Yung Hwan
AU - Choi, Ho Jin
N1 - Publisher Copyright:
© 2019, Springer Science+Business Media, LLC, part of Springer Nature.
PY - 2020/10/1
Y1 - 2020/10/1
N2 - In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures.
AB - In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures.
KW - Bidirectional long short-term memory
KW - Long short-term memory
KW - Pitch classification
KW - Recurrent neural network
KW - Speech pitch estimation
KW - Speech segregation
UR - http://www.scopus.com/inward/record.url?scp=85077145665&partnerID=8YFLogxK
U2 - 10.1007/s11227-019-02785-x
DO - 10.1007/s11227-019-02785-x
M3 - Article
AN - SCOPUS:85077145665
SN - 0920-8542
VL - 76
SP - 8193
EP - 8213
JO - Journal of Supercomputing
JF - Journal of Supercomputing
IS - 10
ER -