TY - GEN
T1 - Speaker dependent visual speech recognition by symbol and real value assignment
AU - Ju, Jeongwoo
AU - Jung, Heechul
AU - Kim, Junmo
PY - 2013
Y1 - 2013
N2 - In this paper, we propose a visual speech recognition method using symbol or real value assignment. Our method is inspired by Bag of Word (BoW) [1] model which is usually applied to an object matching problem. In the BoW model, a codebook is produced by using K-means clustering, and a feature vector extracted from an image is converted to corresponding symbol. Similarly, we generate codebook by running K-means algorithm on a pool of pHog (Pyramid Histogram of Oriented Gradients) feature vectors extracted from a subset of lip database. Then, the remaining lip images are assigned a particular value after comparing the chi-square distance to each cluster. Based on the type of this value, two methods are suggested so as to assign the value to a lip image frame. The first method is to find the cluster whose element image has the minimum chi square distance to the processing frame, and assign the cluster label to the frame. Second one is to calculate the distances between the frame and all cluster's centroids, obtain multi-dimensional vector for the frame which directly becomes an assigned value for the frame. Following these methods, each time sequence is converted into symbolized or multi-dimensional real valued sequence. To measure the similarity between two time sequences, we use Dynamic Time Warping for real valued time sequence and Edit distance for symbolized sequences.
AB - In this paper, we propose a visual speech recognition method using symbol or real value assignment. Our method is inspired by Bag of Word (BoW) [1] model which is usually applied to an object matching problem. In the BoW model, a codebook is produced by using K-means clustering, and a feature vector extracted from an image is converted to corresponding symbol. Similarly, we generate codebook by running K-means algorithm on a pool of pHog (Pyramid Histogram of Oriented Gradients) feature vectors extracted from a subset of lip database. Then, the remaining lip images are assigned a particular value after comparing the chi-square distance to each cluster. Based on the type of this value, two methods are suggested so as to assign the value to a lip image frame. The first method is to find the cluster whose element image has the minimum chi square distance to the processing frame, and assign the cluster label to the frame. Second one is to calculate the distances between the frame and all cluster's centroids, obtain multi-dimensional vector for the frame which directly becomes an assigned value for the frame. Following these methods, each time sequence is converted into symbolized or multi-dimensional real valued sequence. To measure the similarity between two time sequences, we use Dynamic Time Warping for real valued time sequence and Edit distance for symbolized sequences.
KW - Codebook
KW - Dynamic Time Warping
KW - Edit distance
KW - pHog
KW - Visual Speech Recognition
UR - http://www.scopus.com/inward/record.url?scp=84876225154&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-37374-9_98
DO - 10.1007/978-3-642-37374-9_98
M3 - Conference contribution
AN - SCOPUS:84876225154
SN - 9783642373732
T3 - Advances in Intelligent Systems and Computing
SP - 1015
EP - 1022
BT - An Edition of the Presented Papers from the 1st International Conference on Robot Intelligence Technology and Applications
PB - Springer Verlag
T2 - 1st International Conference on Robot Intelligence Technology and Applications, RiTA 2012
Y2 - 16 December 2012 through 18 December 2012
ER -