TY - GEN
T1 - Forced-alignment and edit-distance scoring for vocabulary tutoring applications
AU - Pakhomov, Serguei
AU - Richardson, Jayson
AU - Finholt-Daniel, Matt
AU - Sales, Gregory
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2008
Y1 - 2008
N2 - We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78-0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).
AB - We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78-0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).
KW - Automatic speech recognition
KW - Sub-word language modeling
KW - Vocabulary tutor
UR - http://www.scopus.com/inward/record.url?scp=53049110386&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=53049110386&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-87391-4_57
DO - 10.1007/978-3-540-87391-4_57
M3 - Conference contribution
AN - SCOPUS:53049110386
SN - 3540873902
SN - 9783540873907
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 443
EP - 450
BT - Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings
T2 - 11th International Conference on Text, Speech and Dialogue, TSD 2008
Y2 - 8 September 2008 through 12 September 2008
ER -