TY - JOUR
T1 - TSDCA-BA
T2 - An Ultra-Lightweight Speech Enhancement Model for Real-Time Hearing Aids with Multi-Scale STFT Fusion
AU - Fan, Zujie
AU - Guo, Zikun
AU - Lai, Yanxing
AU - Kim, Jaesoo
N1 - Publisher Copyright:
© 2025 by the authors.
PY - 2025/8
Y1 - 2025/8
N2 - Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a lightweight hybrid module, Temporal Statistics Enhancement, Squeeze-and-Excitation-based Dual Convolutional Attention, and Band-wise Attention (TSE, SDCA, BA) Module. The TSE module enhances single-frame spectral features by concatenating statistical descriptors—mean, standard deviation, maximum, and minimum—thereby capturing richer local information without relying on temporal context. The SDCA and BA module integrates a simplified residual structure and channel attention, while the BA component further strengthens the representation of critical frequency bands through band-wise partitioning and differentiated weighting. The proposed model requires only 0.22 million multiply–accumulate operations (MMACs) and contains a total of 112.3 K parameters, making it well suited for low-latency, real-time speech enhancement applications. Experimental results demonstrate that among lightweight models with fewer than 200K parameters, the proposed approach outperforms most existing methods in both denoising performance and computational efficiency, significantly reducing processing overhead. Furthermore, real-device deployment on an improved hearing aid confirms an inference latency as low as 2 milliseconds, validating its practical potential for real-time edge applications.
AB - Lightweight speech denoising models have made remarkable progress in improving both speech quality and computational efficiency. However, most models rely on long temporal windows as input, limiting their applicability in low-latency, real-time scenarios on edge devices. To address this challenge, we propose a lightweight hybrid module, Temporal Statistics Enhancement, Squeeze-and-Excitation-based Dual Convolutional Attention, and Band-wise Attention (TSE, SDCA, BA) Module. The TSE module enhances single-frame spectral features by concatenating statistical descriptors—mean, standard deviation, maximum, and minimum—thereby capturing richer local information without relying on temporal context. The SDCA and BA module integrates a simplified residual structure and channel attention, while the BA component further strengthens the representation of critical frequency bands through band-wise partitioning and differentiated weighting. The proposed model requires only 0.22 million multiply–accumulate operations (MMACs) and contains a total of 112.3 K parameters, making it well suited for low-latency, real-time speech enhancement applications. Experimental results demonstrate that among lightweight models with fewer than 200K parameters, the proposed approach outperforms most existing methods in both denoising performance and computational efficiency, significantly reducing processing overhead. Furthermore, real-device deployment on an improved hearing aid confirms an inference latency as low as 2 milliseconds, validating its practical potential for real-time edge applications.
KW - audio denoising
KW - band-wise attention
KW - lightweight model
KW - low complexity
KW - multi-scale STFT
KW - real-time speech enhancement
UR - https://www.scopus.com/pages/publications/105013102291
U2 - 10.3390/app15158183
DO - 10.3390/app15158183
M3 - Article
AN - SCOPUS:105013102291
SN - 2076-3417
VL - 15
JO - Applied Sciences (Switzerland)
JF - Applied Sciences (Switzerland)
IS - 15
M1 - 8183
ER -