TY - GEN
T1 - Multi-Layer Depth Weighted Fusion Approach for Speech Emotion Recognition
AU - Kakuba, Samuel
AU - Han, Dong Seog
N1 - Publisher Copyright:
© 2025 IEEE.
PY - 2025
Y1 - 2025
N2 - Fusion techniques have been proposed as a solution to data scarcity in speech emotion recognition (SER). The conventional fusion techniques are broadly classified into early, intermediate, and late fusion. Though in some cases they exhibit commendable results, they are sub optimal. They limit the model to learn distinctive and salient features, which are particularly crucial for enhancing performance in data-scarce scenarios. This is especially due to data sparsity and loss of emotional information as a result of fusion. In this paper, we introduce a multi-layer depth weighted fusion approach for SER. This approach fuses the feature representations from two branches of features across shallow, intermediate, and high-level stages of a deep network. This approach enhances SER performance by utilizing attentive convolution neural network (CNN) encoders, transformer encoders and bidirectional long-short term memory (LSTM) to capture the contextualized spatial and temporal feature relationships. The model achieves accuracy scores of 87.17%, 93.73%, and 96.58% on the KESDy18, RAVDESS, and EMODB datasets, respectively, and F1 scores of 87.84%, 93.73%, and 96.39% on the same datasets. Additionally, the model was evaluated on the SAVEE and CREMA datasets. These performance results highlight the effectiveness and robustness of our fusion approach across multiple emotional speech corpora.
AB - Fusion techniques have been proposed as a solution to data scarcity in speech emotion recognition (SER). The conventional fusion techniques are broadly classified into early, intermediate, and late fusion. Though in some cases they exhibit commendable results, they are sub optimal. They limit the model to learn distinctive and salient features, which are particularly crucial for enhancing performance in data-scarce scenarios. This is especially due to data sparsity and loss of emotional information as a result of fusion. In this paper, we introduce a multi-layer depth weighted fusion approach for SER. This approach fuses the feature representations from two branches of features across shallow, intermediate, and high-level stages of a deep network. This approach enhances SER performance by utilizing attentive convolution neural network (CNN) encoders, transformer encoders and bidirectional long-short term memory (LSTM) to capture the contextualized spatial and temporal feature relationships. The model achieves accuracy scores of 87.17%, 93.73%, and 96.58% on the KESDy18, RAVDESS, and EMODB datasets, respectively, and F1 scores of 87.84%, 93.73%, and 96.39% on the same datasets. Additionally, the model was evaluated on the SAVEE and CREMA datasets. These performance results highlight the effectiveness and robustness of our fusion approach across multiple emotional speech corpora.
KW - emotion recognition
KW - multi-layer
KW - weight fusion
UR - https://www.scopus.com/pages/publications/105018740525
U2 - 10.1109/ICUFN65838.2025.11169883
DO - 10.1109/ICUFN65838.2025.11169883
M3 - Conference contribution
AN - SCOPUS:105018740525
T3 - International Conference on Ubiquitous and Future Networks, ICUFN
SP - 55
EP - 58
BT - ICUFN 2025 - 16th International Conference on Ubiquitous and Future Networks
PB - IEEE Computer Society
T2 - 16th International Conference on Ubiquitous and Future Networks, ICUFN 2025
Y2 - 8 July 2025 through 11 July 2025
ER -