Abstract
Although recent deep-learning-based speech enhancement (SE) methods significantly outperform traditional approaches, their computational demands often scale proportionally with their performance. This scaling typically makes them impractical for deployment on data throughput-sensitive and resource-constrained edge devices. In this paper, we propose a novel lightweight spectral enhancement network (LSENet) designed to estimate high-quality speech with minimal computational overhead. The network consists of an encoder-decoder architecture enhanced by a group-dilated convolutional module, which efficiently leverages time-frequency domain information while significantly reducing resource consumption through dilated convolutional groups and spectral-wise attention modules. Additionally, to capture the long-range contextual dependencies of the extracted features, an improved dual-path recurrent neural network is introduced between the encoder and decoder structures. Experimental results show that the proposed model achieves competitive performance with state-of-the-art baseline models on the Voicebank + Demand and DNS-Challenge datasets while requiring only 39.4 thousand model parameters and 237 million multiply-accumulate operations.
| Original language | English |
|---|---|
| Pages (from-to) | 116934-116943 |
| Number of pages | 10 |
| Journal | IEEE Access |
| Volume | 13 |
| DOIs | |
| State | Published - 2025 |
Keywords
- Deep learning
- attention mechanisms
- factorized convolution
- lightweight network
- speech enhancement