Speech-driven talking face using embedded confusable system for real time mobile multimedia

Po Yi Shih, Anand Paul, Jhing Fa Wang, Yi Hung Chen

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.

Original languageEnglish
Pages (from-to)417-437
Number of pages21
JournalMultimedia Tools and Applications
Volume73
Issue number1
DOIs
StatePublished - 17 Sep 2014

Keywords

  • Confusion matrix
  • Hidden markov model (HMM)
  • Lip-synch
  • Multiclass support vector machine (MCSVM)
  • Real-time speech driven
  • Talking face
  • Viseme histogram similarity

Fingerprint

Dive into the research topics of 'Speech-driven talking face using embedded confusable system for real time mobile multimedia'. Together they form a unique fingerprint.

Cite this