Addressing data scarcity in speech emotion recognition: A comprehensive review

Samuel Kakuba, Dong Seog Han

Research output: Contribution to journalReview articlepeer-review

Abstract

Speech emotion recognition (SER) is a critical field within affective computing, aiming to detect and classify emotional states from speech signals, which vary dynamically over time. These signals encode complex relationships between features at multiple time scales, effectively reflecting a speaker's emotional state. Despite significant progress, SER faces the persistent challenge of labeled data scarcity, a major obstacle given the data-intensive requirements of deep learning models. This scarcity often results in small, imbalanced datasets that hinder model generalization. Various strategies, including feature selection, data augmentation, domain adaptation, and fusion techniques, have been employed to mitigate these issues. However, comprehensive reviews that critically analyze these methods remain limited. In this paper, we provide an extensive review of these data scarcity strategies in SER, assessing their merits and limitations in terms of efficiency and robustness. Special attention is given to how these strategies enhance the performance of both acoustic and multimodal SER systems when operating on limited datasets. Additionally, we highlight the potential of fusion strategies combined with attention mechanisms as promising solutions to improve convergence and reduce model complexity.

Original languageEnglish
Pages (from-to)110-123
Number of pages14
JournalICT Express
Volume11
Issue number1
DOIs
StatePublished - Feb 2025

Keywords

  • Attention mechanisms
  • Data scarcity
  • Emotion recognition
  • Limited datasets

Fingerprint

Dive into the research topics of 'Addressing data scarcity in speech emotion recognition: A comprehensive review'. Together they form a unique fingerprint.

Cite this