Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Sobia Nasir Laique; Umar Hayat; Shashank Sarvepalli; Byron Vaughn; Mounir Ibrahim; John McMichael; Kanza Noor Qaiser; Carol Burke; Amit Bhatt; Colin Rhodes; Maged K. Rizk

doi:10.1016/j.gie.2020.08.038

Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Sobia Nasir Laique, Umar Hayat, Shashank Sarvepalli, Byron Vaughn, Mounir Ibrahim, John McMichael, Kanza Noor Qaiser, Carol Burke, Amit Bhatt, Colin Rhodes, Maged K. Rizk

Medicine - Gastro, Hepatology, Nutrition Division

Research output: Contribution to journal › Article › peer-review

26 Scopus citations

Abstract

Background and Aims: Colonoscopy is commonly performed for colorectal cancer screening in the United States. Reports are often generated in a non-standardized format and are not always integrated into electronic health records. Thus, this information is not readily available for streamlining quality management, participating in endoscopy registries, or reporting of patient- and center-specific risk factors predictive of outcomes. We aim to demonstrate the use of a new hybrid approach using natural language processing of charts that have been elucidated with optical character recognition processing (OCR/NLP hybrid) to obtain relevant clinical information from scanned colonoscopy and pathology reports, a technology co-developed by Cleveland Clinic and eHealth Technologies (West Henrietta, NY, USA). Methods: This was a retrospective study conducted at Cleveland Clinic, Cleveland, Ohio, and the University of Minnesota, Minneapolis, Minnesota. A randomly sampled list of outpatient screening colonoscopy procedures and pathology reports was selected. Desired variables were then collected. Two researchers first manually reviewed the reports for the desired variables. Then, the OCR/NLP algorithm was used to obtain the same variables from 3 electronic health records in use at our institution: Epic (Verona, Wisc, USA), ProVation (Minneapolis, Minn, USA) used for endoscopy reporting, and Sunquest PowerPath (Tucson, Ariz, USA) used for pathology reporting. Results: Compared with manual data extraction, the accuracy of the hybrid OCR/NLP approach to detect polyps was 95.8%, adenomas 98.5%, sessile serrated polyps 99.3%, advanced adenomas 98%, inadequate bowel preparation 98.4%, and failed cecal intubation 99%. Comparison of the dataset collected via NLP alone with that collected using the hybrid OCR/NLP approach showed that the accuracy for almost all variables was >99%. Conclusions: Our study is the first to validate the use of a unique hybrid OCR/NLP technology to extract desired variables from scanned procedure and pathology reports contained in image format with an accuracy >95%.

Original language	English (US)
Pages (from-to)	750-757
Number of pages	8
Journal	Gastrointestinal endoscopy
Volume	93
Issue number	3
DOIs	https://doi.org/10.1016/j.gie.2020.08.038
State	Published - Mar 2021

Bibliographical note

Publisher Copyright:
© 2021 American Society for Gastrointestinal Endoscopy

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1016/j.gie.2020.08.038

OpenUrl availability

Full text

Cite this

Laique, S. N., Hayat, U., Sarvepalli, S., Vaughn, B., Ibrahim, M., McMichael, J., Qaiser, K. N., Burke, C., Bhatt, A., Rhodes, C., & Rizk, M. K. (2021). Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports. Gastrointestinal endoscopy, 93(3), 750-757. https://doi.org/10.1016/j.gie.2020.08.038

Laique, SN, Hayat, U, Sarvepalli, S, Vaughn, B, Ibrahim, M, McMichael, J, Qaiser, KN, Burke, C, Bhatt, A, Rhodes, C & Rizk, MK 2021, 'Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports', Gastrointestinal endoscopy, vol. 93, no. 3, pp. 750-757. https://doi.org/10.1016/j.gie.2020.08.038

@article{e202b04783544f19abe3e1448c125749,

title = "Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports",

abstract = "Background and Aims: Colonoscopy is commonly performed for colorectal cancer screening in the United States. Reports are often generated in a non-standardized format and are not always integrated into electronic health records. Thus, this information is not readily available for streamlining quality management, participating in endoscopy registries, or reporting of patient- and center-specific risk factors predictive of outcomes. We aim to demonstrate the use of a new hybrid approach using natural language processing of charts that have been elucidated with optical character recognition processing (OCR/NLP hybrid) to obtain relevant clinical information from scanned colonoscopy and pathology reports, a technology co-developed by Cleveland Clinic and eHealth Technologies (West Henrietta, NY, USA). Methods: This was a retrospective study conducted at Cleveland Clinic, Cleveland, Ohio, and the University of Minnesota, Minneapolis, Minnesota. A randomly sampled list of outpatient screening colonoscopy procedures and pathology reports was selected. Desired variables were then collected. Two researchers first manually reviewed the reports for the desired variables. Then, the OCR/NLP algorithm was used to obtain the same variables from 3 electronic health records in use at our institution: Epic (Verona, Wisc, USA), ProVation (Minneapolis, Minn, USA) used for endoscopy reporting, and Sunquest PowerPath (Tucson, Ariz, USA) used for pathology reporting. Results: Compared with manual data extraction, the accuracy of the hybrid OCR/NLP approach to detect polyps was 95.8%, adenomas 98.5%, sessile serrated polyps 99.3%, advanced adenomas 98%, inadequate bowel preparation 98.4%, and failed cecal intubation 99%. Comparison of the dataset collected via NLP alone with that collected using the hybrid OCR/NLP approach showed that the accuracy for almost all variables was >99%. Conclusions: Our study is the first to validate the use of a unique hybrid OCR/NLP technology to extract desired variables from scanned procedure and pathology reports contained in image format with an accuracy >95%.",

author = "Laique, {Sobia Nasir} and Umar Hayat and Shashank Sarvepalli and Byron Vaughn and Mounir Ibrahim and John McMichael and Qaiser, {Kanza Noor} and Carol Burke and Amit Bhatt and Colin Rhodes and Rizk, {Maged K.}",

note = "Publisher Copyright: {\textcopyright} 2021 American Society for Gastrointestinal Endoscopy",

year = "2021",

month = mar,

doi = "10.1016/j.gie.2020.08.038",

language = "English (US)",

volume = "93",

pages = "750--757",

journal = "Gastrointestinal endoscopy",

issn = "0016-5107",

publisher = "Mosby Inc.",

number = "3",

}

TY - JOUR

T1 - Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

AU - Laique, Sobia Nasir

AU - Hayat, Umar

AU - Sarvepalli, Shashank

AU - Vaughn, Byron

AU - Ibrahim, Mounir

AU - McMichael, John

AU - Qaiser, Kanza Noor

AU - Burke, Carol

AU - Bhatt, Amit

AU - Rhodes, Colin

AU - Rizk, Maged K.

PY - 2021/3

Y1 - 2021/3

N2 - Background and Aims: Colonoscopy is commonly performed for colorectal cancer screening in the United States. Reports are often generated in a non-standardized format and are not always integrated into electronic health records. Thus, this information is not readily available for streamlining quality management, participating in endoscopy registries, or reporting of patient- and center-specific risk factors predictive of outcomes. We aim to demonstrate the use of a new hybrid approach using natural language processing of charts that have been elucidated with optical character recognition processing (OCR/NLP hybrid) to obtain relevant clinical information from scanned colonoscopy and pathology reports, a technology co-developed by Cleveland Clinic and eHealth Technologies (West Henrietta, NY, USA). Methods: This was a retrospective study conducted at Cleveland Clinic, Cleveland, Ohio, and the University of Minnesota, Minneapolis, Minnesota. A randomly sampled list of outpatient screening colonoscopy procedures and pathology reports was selected. Desired variables were then collected. Two researchers first manually reviewed the reports for the desired variables. Then, the OCR/NLP algorithm was used to obtain the same variables from 3 electronic health records in use at our institution: Epic (Verona, Wisc, USA), ProVation (Minneapolis, Minn, USA) used for endoscopy reporting, and Sunquest PowerPath (Tucson, Ariz, USA) used for pathology reporting. Results: Compared with manual data extraction, the accuracy of the hybrid OCR/NLP approach to detect polyps was 95.8%, adenomas 98.5%, sessile serrated polyps 99.3%, advanced adenomas 98%, inadequate bowel preparation 98.4%, and failed cecal intubation 99%. Comparison of the dataset collected via NLP alone with that collected using the hybrid OCR/NLP approach showed that the accuracy for almost all variables was >99%. Conclusions: Our study is the first to validate the use of a unique hybrid OCR/NLP technology to extract desired variables from scanned procedure and pathology reports contained in image format with an accuracy >95%.

AB - Background and Aims: Colonoscopy is commonly performed for colorectal cancer screening in the United States. Reports are often generated in a non-standardized format and are not always integrated into electronic health records. Thus, this information is not readily available for streamlining quality management, participating in endoscopy registries, or reporting of patient- and center-specific risk factors predictive of outcomes. We aim to demonstrate the use of a new hybrid approach using natural language processing of charts that have been elucidated with optical character recognition processing (OCR/NLP hybrid) to obtain relevant clinical information from scanned colonoscopy and pathology reports, a technology co-developed by Cleveland Clinic and eHealth Technologies (West Henrietta, NY, USA). Methods: This was a retrospective study conducted at Cleveland Clinic, Cleveland, Ohio, and the University of Minnesota, Minneapolis, Minnesota. A randomly sampled list of outpatient screening colonoscopy procedures and pathology reports was selected. Desired variables were then collected. Two researchers first manually reviewed the reports for the desired variables. Then, the OCR/NLP algorithm was used to obtain the same variables from 3 electronic health records in use at our institution: Epic (Verona, Wisc, USA), ProVation (Minneapolis, Minn, USA) used for endoscopy reporting, and Sunquest PowerPath (Tucson, Ariz, USA) used for pathology reporting. Results: Compared with manual data extraction, the accuracy of the hybrid OCR/NLP approach to detect polyps was 95.8%, adenomas 98.5%, sessile serrated polyps 99.3%, advanced adenomas 98%, inadequate bowel preparation 98.4%, and failed cecal intubation 99%. Comparison of the dataset collected via NLP alone with that collected using the hybrid OCR/NLP approach showed that the accuracy for almost all variables was >99%. Conclusions: Our study is the first to validate the use of a unique hybrid OCR/NLP technology to extract desired variables from scanned procedure and pathology reports contained in image format with an accuracy >95%.

UR - http://www.scopus.com/inward/record.url?scp=85096376341&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85096376341&partnerID=8YFLogxK

U2 - 10.1016/j.gie.2020.08.038

DO - 10.1016/j.gie.2020.08.038

M3 - Article

C2 - 32891620

AN - SCOPUS:85096376341

SN - 0016-5107

VL - 93

SP - 750

EP - 757

JO - Gastrointestinal endoscopy

JF - Gastrointestinal endoscopy

IS - 3

ER -

Application of optical character recognition with natural language processing for large-scale quality metric data extraction in colonoscopy reports

Abstract

Bibliographical note

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this