TY - GEN
T1 - Speech Emotion Recognition using Context-Aware Dilated Convolution Network
AU - Kakuba, Samuel
AU - Han, Dong Seog
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Deep learning-based speech emotion recognition has been applied for social living assistance, health monitoring, authentication, and other human-to-machine interaction applications. Because of the ubiquitous nature of the applications, computationally efficient and robust speech emotion recognition models are required. The nature of the speech signal requires tracking of time steps, analyzing long-term dependencies and the contexts of the utterances as well as the spatial cues. Recurrent neural networks like long short-term memory and gated recurrent units coupled with attention mechanisms are often used to consider long-term dependencies and context in the speech signal. However, they do not take care of the spatial cues that may exist in the speech signal. Moreover, the operation of most of these systems is sequential which causes slow convergence, and sluggish training. Therefore, we propose a model that employs dilated convolutions layers in combination with hybrid attention mechanisms. The model uses multi-head attention to extract the global context in the feature representations which are fed into the bidirectional long short-term memory configured with self-attention to further handle the context and long-term dependencies. The model uses spectral and voice quality features extracted from the raw speech signals as input. The proposed model achieves comparable performance in terms of F1 score and accuracy. The proposed model's performance is also presented in terms of confusion matrices.
AB - Deep learning-based speech emotion recognition has been applied for social living assistance, health monitoring, authentication, and other human-to-machine interaction applications. Because of the ubiquitous nature of the applications, computationally efficient and robust speech emotion recognition models are required. The nature of the speech signal requires tracking of time steps, analyzing long-term dependencies and the contexts of the utterances as well as the spatial cues. Recurrent neural networks like long short-term memory and gated recurrent units coupled with attention mechanisms are often used to consider long-term dependencies and context in the speech signal. However, they do not take care of the spatial cues that may exist in the speech signal. Moreover, the operation of most of these systems is sequential which causes slow convergence, and sluggish training. Therefore, we propose a model that employs dilated convolutions layers in combination with hybrid attention mechanisms. The model uses multi-head attention to extract the global context in the feature representations which are fed into the bidirectional long short-term memory configured with self-attention to further handle the context and long-term dependencies. The model uses spectral and voice quality features extracted from the raw speech signals as input. The proposed model achieves comparable performance in terms of F1 score and accuracy. The proposed model's performance is also presented in terms of confusion matrices.
KW - context-aware emotion recognition
KW - dilated convolution
KW - multi-head attention
UR - http://www.scopus.com/inward/record.url?scp=85143051833&partnerID=8YFLogxK
U2 - 10.1109/APCC55198.2022.9943771
DO - 10.1109/APCC55198.2022.9943771
M3 - Conference contribution
AN - SCOPUS:85143051833
T3 - APCC 2022 - 27th Asia-Pacific Conference on Communications: Creating Innovative Communication Technologies for Post-Pandemic Era
SP - 601
EP - 604
BT - APCC 2022 - 27th Asia-Pacific Conference on Communications
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 27th Asia-Pacific Conference on Communications, APCC 2022
Y2 - 19 October 2022 through 21 October 2022
ER -