Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition

June Woo Kim, Hoon Chung, Ho Young Jung

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Unsupervised learning-based approaches for training speech vector representations (SVR) have recently been widely applied. While pretrained SVR models excel in relatively clean automatic speech recognition (ASR) tasks, such as those recorded in laboratory environments, they are still insufficient for practical applications with various types of noise, intonation, and dialects. To cope with this problem, we present a novel unsupervised SVR learning method for practical end-to-end ASR models. Our approach involves designing a speech feature masking method to stabilize SVR model learning and improve the performance of the ASR model in a downstream task. By introducing a noise masking strategy into diverse combinations of the time and frequency regions of the spectrogram, the SVR model becomes a robust representation extractor for the ASR model in practical scenarios. In pretraining experiments, we train the SVR model using approximately 18,000 h of Korean speech datasets that included diverse speakers and were recorded in environments with various amounts of noise. The weights of the pretrained SVR extractor are then frozen, and the extracted speech representations are used for ASR model training in a downstream task. The experimental results show that the ASR model using our proposed SVR extractor significantly outperforms conventional methods.

Original languageEnglish
Article number622
JournalMathematics
Volume11
Issue number3
DOIs
StatePublished - Feb 2023

Keywords

  • deep learning
  • feature representation extractor
  • neural network
  • representation learning
  • speech processing
  • speech recognition
  • speech vector representation
  • unsupervised learning

Fingerprint

Dive into the research topics of 'Unsupervised Representation Learning with Task-Agnostic Feature Masking for Robust End-to-End Speech Recognition'. Together they form a unique fingerprint.

Cite this