Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data

Gregory P Finley; Serguei V.S. Pakhomov; Reed McEwan; Genevieve B. Melton

Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data

Gregory P Finley, Serguei V.S. Pakhomov, Reed McEwan, Genevieve B. Melton

Research output: Contribution to journal › Article › peer-review

Abstract

Abbreviation disambiguation in clinical texts is a problem handled well by fully supervised machine learning methods. Acquiring training data, however, is expensive and would be impractical for large numbers of abbreviations in specialized corpora. An alternative is a semi-supervised approach, in which training data are automatically generated by substituting long forms in natural text with their corresponding abbreviations. Most prior implementations of this method either focus on very few abbreviations or do not test on real-world data. We present a realistic use case by testing several semi-supervised classification algorithms on a large hand-annotated medical record of occurrences of 74 ambiguous abbreviations. Despite notable differences between training and test corpora, classifiers achieve up to 90% accuracy. Our tests demonstrate that semi-supervised abbreviation disambiguation is a viable and extensible option for medical NLP systems.

Original language	English (US)
Pages (from-to)	560-569
Number of pages	10
Journal	AMIA ... Annual Symposium proceedings. AMIA Symposium
Volume	2016
State	Published - 2016

OpenUrl availability

Full text

Cite this

@article{e789cc427acb444e82fec01eb7b366d5,

title = "Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data",

abstract = "Abbreviation disambiguation in clinical texts is a problem handled well by fully supervised machine learning methods. Acquiring training data, however, is expensive and would be impractical for large numbers of abbreviations in specialized corpora. An alternative is a semi-supervised approach, in which training data are automatically generated by substituting long forms in natural text with their corresponding abbreviations. Most prior implementations of this method either focus on very few abbreviations or do not test on real-world data. We present a realistic use case by testing several semi-supervised classification algorithms on a large hand-annotated medical record of occurrences of 74 ambiguous abbreviations. Despite notable differences between training and test corpora, classifiers achieve up to 90% accuracy. Our tests demonstrate that semi-supervised abbreviation disambiguation is a viable and extensible option for medical NLP systems.",

author = "Finley, {Gregory P} and Pakhomov, {Serguei V.S.} and Reed McEwan and Melton, {Genevieve B.}",

year = "2016",

language = "English (US)",

volume = "2016",

pages = "560--569",

journal = "AMIA ... Annual Symposium proceedings. AMIA Symposium",

issn = "1559-4076",

publisher = "American Medical Informatics Association",

}

TY - JOUR

T1 - Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data

AU - Finley, Gregory P

AU - Pakhomov, Serguei V.S.

AU - McEwan, Reed

AU - Melton, Genevieve B.

PY - 2016

Y1 - 2016

N2 - Abbreviation disambiguation in clinical texts is a problem handled well by fully supervised machine learning methods. Acquiring training data, however, is expensive and would be impractical for large numbers of abbreviations in specialized corpora. An alternative is a semi-supervised approach, in which training data are automatically generated by substituting long forms in natural text with their corresponding abbreviations. Most prior implementations of this method either focus on very few abbreviations or do not test on real-world data. We present a realistic use case by testing several semi-supervised classification algorithms on a large hand-annotated medical record of occurrences of 74 ambiguous abbreviations. Despite notable differences between training and test corpora, classifiers achieve up to 90% accuracy. Our tests demonstrate that semi-supervised abbreviation disambiguation is a viable and extensible option for medical NLP systems.

AB - Abbreviation disambiguation in clinical texts is a problem handled well by fully supervised machine learning methods. Acquiring training data, however, is expensive and would be impractical for large numbers of abbreviations in specialized corpora. An alternative is a semi-supervised approach, in which training data are automatically generated by substituting long forms in natural text with their corresponding abbreviations. Most prior implementations of this method either focus on very few abbreviations or do not test on real-world data. We present a realistic use case by testing several semi-supervised classification algorithms on a large hand-annotated medical record of occurrences of 74 ambiguous abbreviations. Despite notable differences between training and test corpora, classifiers achieve up to 90% accuracy. Our tests demonstrate that semi-supervised abbreviation disambiguation is a viable and extensible option for medical NLP systems.

UR - http://www.scopus.com/inward/record.url?scp=85026685803&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85026685803&partnerID=8YFLogxK

M3 - Article

C2 - 28269852

AN - SCOPUS:85026685803

SN - 1559-4076

VL - 2016

SP - 560

EP - 569

JO - AMIA ... Annual Symposium proceedings. AMIA Symposium

JF - AMIA ... Annual Symposium proceedings. AMIA Symposium

ER -

Towards Comprehensive Clinical Abbreviation Disambiguation Using Machine-Labeled Training Data

Abstract

OpenUrl availability

Other files and links

Fingerprint

Cite this