Vocoder-free End-to-End Voice Conversion with Transformer Network

June Woo Kim, Ho Young Jung, Minho Lee

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

Mel-frequency filter bank (MFB) based approaches have the advantage of higher learning speeds compared to using the raw spectrum due to a smaller number of features. However, speech generators with the MFB approach require an additional computationally expensive vocoder for the training process. The pre- and post-processing needed by the MFB and the vocoder is not essential to convert human voices, because it is possible to use only the raw spectrum to generate different style of voices with clear pronunciation. In this paper, we introduce a vocoder-free end-to-end voice conversion method using a transformer network to alleviate the computational burden from additional pre- and post-processing. Our transformer-based architecture, which does not have any CNN or RNN layers, has shown the benefit of learning fast while solving the limitation of sequential computation of the conventional RNN. For this reason, our model is a fast and effective approach to convert realistic voices using raw spectra in a parallel manner to generate different style of voices with clear pronunciation. Furthermore, we can get an adapted MFB for speech recognition by multiplying the converted magnitude with the phase information, and therefore our conversion model is also suitable for speaker adaptation. We perform our voice conversion experiments on TIDIGITS-dataset using the naturalness, similarity, and clarity with Mean Opinion Score as metrics.1

Original languageEnglish
Title of host publication2020 International Joint Conference on Neural Networks, IJCNN 2020 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728169262
DOIs
StatePublished - Jul 2020
Event2020 International Joint Conference on Neural Networks, IJCNN 2020 - Virtual, Glasgow, United Kingdom
Duration: 19 Jul 202024 Jul 2020

Publication series

NameProceedings of the International Joint Conference on Neural Networks

Conference

Conference2020 International Joint Conference on Neural Networks, IJCNN 2020
Country/TerritoryUnited Kingdom
CityVirtual, Glasgow
Period19/07/2024/07/20

Keywords

  • phase
  • spectrum
  • transformer
  • vocoder-free
  • voice conversion

Fingerprint

Dive into the research topics of 'Vocoder-free End-to-End Voice Conversion with Transformer Network'. Together they form a unique fingerprint.

Cite this