TY - JOUR
T1 - Addressing data scarcity in speech emotion recognition
T2 - A comprehensive review
AU - Kakuba, Samuel
AU - Han, Dong Seog
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2025/2
Y1 - 2025/2
N2 - Speech emotion recognition (SER) is a critical field within affective computing, aiming to detect and classify emotional states from speech signals, which vary dynamically over time. These signals encode complex relationships between features at multiple time scales, effectively reflecting a speaker's emotional state. Despite significant progress, SER faces the persistent challenge of labeled data scarcity, a major obstacle given the data-intensive requirements of deep learning models. This scarcity often results in small, imbalanced datasets that hinder model generalization. Various strategies, including feature selection, data augmentation, domain adaptation, and fusion techniques, have been employed to mitigate these issues. However, comprehensive reviews that critically analyze these methods remain limited. In this paper, we provide an extensive review of these data scarcity strategies in SER, assessing their merits and limitations in terms of efficiency and robustness. Special attention is given to how these strategies enhance the performance of both acoustic and multimodal SER systems when operating on limited datasets. Additionally, we highlight the potential of fusion strategies combined with attention mechanisms as promising solutions to improve convergence and reduce model complexity.
AB - Speech emotion recognition (SER) is a critical field within affective computing, aiming to detect and classify emotional states from speech signals, which vary dynamically over time. These signals encode complex relationships between features at multiple time scales, effectively reflecting a speaker's emotional state. Despite significant progress, SER faces the persistent challenge of labeled data scarcity, a major obstacle given the data-intensive requirements of deep learning models. This scarcity often results in small, imbalanced datasets that hinder model generalization. Various strategies, including feature selection, data augmentation, domain adaptation, and fusion techniques, have been employed to mitigate these issues. However, comprehensive reviews that critically analyze these methods remain limited. In this paper, we provide an extensive review of these data scarcity strategies in SER, assessing their merits and limitations in terms of efficiency and robustness. Special attention is given to how these strategies enhance the performance of both acoustic and multimodal SER systems when operating on limited datasets. Additionally, we highlight the potential of fusion strategies combined with attention mechanisms as promising solutions to improve convergence and reduce model complexity.
KW - Attention mechanisms
KW - Data scarcity
KW - Emotion recognition
KW - Limited datasets
UR - http://www.scopus.com/inward/record.url?scp=85210015137&partnerID=8YFLogxK
U2 - 10.1016/j.icte.2024.11.003
DO - 10.1016/j.icte.2024.11.003
M3 - Review article
AN - SCOPUS:85210015137
SN - 2405-9595
VL - 11
SP - 110
EP - 123
JO - ICT Express
JF - ICT Express
IS - 1
ER -