Forced-alignment and edit-distance scoring for vocabulary tutoring applications

Serguei Pakhomov; Jayson Richardson; Matt Finholt-Daniel; Gregory Sales

doi:10.1007/978-3-540-87391-4_57

Forced-alignment and edit-distance scoring for vocabulary tutoring applications

Serguei Pakhomov, Jayson Richardson, Matt Finholt-Daniel, Gregory Sales

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78-0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).

Original language	English (US)
Title of host publication	Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings
Pages	443-450
Number of pages	8
DOIs	https://doi.org/10.1007/978-3-540-87391-4_57
State	Published - 2008
Event	11th International Conference on Text, Speech and Dialogue, TSD 2008 - Brno, Czech Republic Duration: Sep 8 2008 → Sep 12 2008

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	5246 LNAI
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	11th International Conference on Text, Speech and Dialogue, TSD 2008
Country/Territory	Czech Republic
City	Brno
Period	9/8/08 → 9/12/08

Keywords

Automatic speech recognition
Sub-word language modeling
Vocabulary tutor

Access

10.1007/978-3-540-87391-4_57

OpenUrl availability

Full text

Cite this

Pakhomov, S., Richardson, J., Finholt-Daniel, M., & Sales, G. (2008). Forced-alignment and edit-distance scoring for vocabulary tutoring applications. In Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings (pp. 443-450). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5246 LNAI). https://doi.org/10.1007/978-3-540-87391-4_57

Forced-alignment and edit-distance scoring for vocabulary tutoring applications. / Pakhomov, Serguei; Richardson, Jayson; Finholt-Daniel, Matt et al.
Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings. 2008. p. 443-450 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5246 LNAI).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Pakhomov, S, Richardson, J, Finholt-Daniel, M & Sales, G 2008, Forced-alignment and edit-distance scoring for vocabulary tutoring applications. in Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5246 LNAI, pp. 443-450, 11th International Conference on Text, Speech and Dialogue, TSD 2008, Brno, Czech Republic, 9/8/08. https://doi.org/10.1007/978-3-540-87391-4_57

Pakhomov S, Richardson J, Finholt-Daniel M, Sales G. Forced-alignment and edit-distance scoring for vocabulary tutoring applications. In Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings. 2008. p. 443-450. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-540-87391-4_57

Pakhomov, Serguei ; Richardson, Jayson ; Finholt-Daniel, Matt et al. / Forced-alignment and edit-distance scoring for vocabulary tutoring applications. Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings. 2008. pp. 443-450 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{cb82bd84f6254a09917fc2cbc2c7a0c0,

title = "Forced-alignment and edit-distance scoring for vocabulary tutoring applications",

abstract = "We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78-0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).",

keywords = "Automatic speech recognition, Sub-word language modeling, Vocabulary tutor",

author = "Serguei Pakhomov and Jayson Richardson and Matt Finholt-Daniel and Gregory Sales",

year = "2008",

doi = "10.1007/978-3-540-87391-4_57",

language = "English (US)",

isbn = "3540873902",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

pages = "443--450",

booktitle = "Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings",

}

TY - GEN

T1 - Forced-alignment and edit-distance scoring for vocabulary tutoring applications

AU - Pakhomov, Serguei

AU - Richardson, Jayson

AU - Finholt-Daniel, Matt

AU - Sales, Gregory

PY - 2008

Y1 - 2008

N2 - We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78-0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).

AB - We demonstrate an application of Automatic Speech Recognition (ASR) technology to the assessment of young children's basic English vocabulary. We use a test set of 2935 speech samples manually rated by 3 reviewers to compare several approaches to measuring and classifying the accuracy of the children's pronunciation of words, including acoustic confidence scoring obtained by forced alignment and edit distance between the expected and actual ASR output. We show that phoneme-level language modeling can be used to obtain good classification results even with a relatively small amount of acoustic training data. The area under the ROC curve of the ASR-based classifier that uses a bi-phone language model interpolated with a general English bi-phone model is 0.80 (95% CI 0.78-0.82). The point where both sensitivity and specificity are at their maximum is where sensitivity is 0.74 and the specificity is 0.80 with 0.77 harmonic mean, which is comparable to human performance (ICC=0.75; absolute agreement = 81%).

KW - Automatic speech recognition

KW - Sub-word language modeling

KW - Vocabulary tutor

UR - http://www.scopus.com/inward/record.url?scp=53049110386&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=53049110386&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-87391-4_57

DO - 10.1007/978-3-540-87391-4_57

M3 - Conference contribution

AN - SCOPUS:53049110386

SN - 3540873902

SN - 9783540873907

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 443

EP - 450

BT - Text, Speech and Dialogue - 11th International Conference, TSD 2008, Proceedings

T2 - 11th International Conference on Text, Speech and Dialogue, TSD 2008

Y2 - 8 September 2008 through 12 September 2008

ER -

Forced-alignment and edit-distance scoring for vocabulary tutoring applications

Abstract

Publication series

Other

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this