TY - JOUR
T1 - Speech-driven talking face using embedded confusable system for real time mobile multimedia
AU - Shih, Po Yi
AU - Paul, Anand
AU - Wang, Jhing Fa
AU - Chen, Yi Hung
N1 - Publisher Copyright:
© 2013, Springer Science+Business Media New York.
PY - 2014/9/17
Y1 - 2014/9/17
N2 - This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.
AB - This paper presents a real-time speech-driven talking face system which provides low computational complexity and smoothly visual sense. A novel embedded confusable system is proposed to generate an efficient phoneme-viseme mapping table which is constructed by phoneme grouping using Houtgast similarity approach based on the results of viseme similarity estimation using histogram distance, according to the concept of viseme visually ambiguous. The generated mapping table can simplify the mapping problem and promote viseme classification accuracy. The implemented real time speech-driven talking face system includes: 1) speech signal processing, including SNR-aware speech enhancement for noise reduction and ICA-based feature set extractions for robust acoustic feature vectors; 2) recognition network processing, HMM and MCSVM are combined as a recognition network approach for phoneme recognition and viseme classification, which HMM is good at dealing with sequential inputs, while MCSVM shows superior performance in classifying with good generalization properties, especially for limited samples. The phoneme-viseme mapping table is used for MCSVM to classify the observation sequence of HMM results, which the viseme class is belong to; 3) visual processing, arranges lip shape image of visemes in time sequence, and presents more authenticity using a dynamic alpha blending with different alpha value settings. Presented by the experiments, the used speech signal processing with noise speech comparing with clean speech, could gain 1.1 % (16.7 % to 15.6 %) and 4.8 % (30.4 % to 35.2 %) accuracy rate improvements in PER and WER, respectively. For viseme classification, the error rate is decreased from 19.22 % to 9.37 %. Last, we simulated a GSM communication between mobile phone and PC for visual quality rating and speech driven feeling using mean opinion score. Therefore, our method reduces the number of visemes and lip shape images by confusable sets and enables real-time operation.
KW - Confusion matrix
KW - Hidden markov model (HMM)
KW - Lip-synch
KW - Multiclass support vector machine (MCSVM)
KW - Real-time speech driven
KW - Talking face
KW - Viseme histogram similarity
UR - http://www.scopus.com/inward/record.url?scp=84919389787&partnerID=8YFLogxK
U2 - 10.1007/s11042-013-1609-3
DO - 10.1007/s11042-013-1609-3
M3 - Article
AN - SCOPUS:84919389787
SN - 1380-7501
VL - 73
SP - 417
EP - 437
JO - Multimedia Tools and Applications
JF - Multimedia Tools and Applications
IS - 1
ER -