Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting

David M Vock; Julian Wolfson; Sunayan Bandyopadhyay; Gediminas Adomavicius; Paul E. Johnson; Gabriela Vazquez-Benitez; Patrick J. O'Connor

doi:10.1016/j.jbi.2016.03.009

Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting

David M Vock, Julian Wolfson, Sunayan Bandyopadhyay, Gediminas Adomavicius, Paul E. Johnson, Gabriela Vazquez-Benitez, Patrick J. O'Connor

Research output: Contribution to journal › Article › peer-review

80 Scopus citations

Abstract

Models for predicting the probability of experiencing various health outcomes or adverse events over a certain time frame (e.g., having a heart attack in the next 5 years) based on individual patient characteristics are important tools for managing patient care. Electronic health data (EHD) are appealing sources of training data because they provide access to large amounts of rich individual-level data from present-day patient populations. However, because EHD are derived by extracting information from administrative and clinical databases, some fraction of subjects will not be under observation for the entire time frame over which one wants to make predictions; this loss to follow-up is often due to disenrollment from the health system. For subjects without complete follow-up, whether or not they experienced the adverse event is unknown, and in statistical terms the event time is said to be right-censored. Most machine learning approaches to the problem have been relatively ad hoc; for example, common approaches for handling observations in which the event status is unknown include (1) discarding those observations, (2) treating them as non-events, (3) splitting those observations into two observations: one where the event occurs and one where the event does not. In this paper, we present a general-purpose approach to account for right-censored outcomes using inverse probability of censoring weighting (IPCW). We illustrate how IPCW can easily be incorporated into a number of existing machine learning algorithms used to mine big health care data including Bayesian networks, k-nearest neighbors, decision trees, and generalized additive models. We then show that our approach leads to better calibrated predictions than the three ad hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event, using EHD from a large U.S. Midwestern healthcare system.

Original language	English (US)
Pages (from-to)	119-131
Number of pages	13
Journal	Journal of Biomedical Informatics
Volume	61
DOIs	https://doi.org/10.1016/j.jbi.2016.03.009
State	Published - Jun 1 2016

Bibliographical note

Funding Information:
This work was partially supported by NHLBI grant R01HL102144-01 , NIH grant UL1TR000114 , and AHRQ grant R21HS017622-01 .

Publisher Copyright:
© 2016 Elsevier Inc.

Keywords

Censored data
Electronic health data
Inverse probability weighting
Machine learning
Risk prediction
Survival analysis

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access

10.1016/j.jbi.2016.03.009

OpenUrl availability

Full text

Cite this

Vock, D. M., Wolfson, J., Bandyopadhyay, S., Adomavicius, G., Johnson, P. E., Vazquez-Benitez, G., & O'Connor, P. J. (2016). Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting. Journal of Biomedical Informatics, 61, 119-131. https://doi.org/10.1016/j.jbi.2016.03.009

@article{e58e03ae13204b78b75087bed6f9bf7a,

title = "Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting",

abstract = "Models for predicting the probability of experiencing various health outcomes or adverse events over a certain time frame (e.g., having a heart attack in the next 5 years) based on individual patient characteristics are important tools for managing patient care. Electronic health data (EHD) are appealing sources of training data because they provide access to large amounts of rich individual-level data from present-day patient populations. However, because EHD are derived by extracting information from administrative and clinical databases, some fraction of subjects will not be under observation for the entire time frame over which one wants to make predictions; this loss to follow-up is often due to disenrollment from the health system. For subjects without complete follow-up, whether or not they experienced the adverse event is unknown, and in statistical terms the event time is said to be right-censored. Most machine learning approaches to the problem have been relatively ad hoc; for example, common approaches for handling observations in which the event status is unknown include (1) discarding those observations, (2) treating them as non-events, (3) splitting those observations into two observations: one where the event occurs and one where the event does not. In this paper, we present a general-purpose approach to account for right-censored outcomes using inverse probability of censoring weighting (IPCW). We illustrate how IPCW can easily be incorporated into a number of existing machine learning algorithms used to mine big health care data including Bayesian networks, k-nearest neighbors, decision trees, and generalized additive models. We then show that our approach leads to better calibrated predictions than the three ad hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event, using EHD from a large U.S. Midwestern healthcare system.",

keywords = "Censored data, Electronic health data, Inverse probability weighting, Machine learning, Risk prediction, Survival analysis",

author = "Vock, {David M} and Julian Wolfson and Sunayan Bandyopadhyay and Gediminas Adomavicius and Johnson, {Paul E.} and Gabriela Vazquez-Benitez and O'Connor, {Patrick J.}",

note = "Funding Information: This work was partially supported by NHLBI grant R01HL102144-01 , NIH grant UL1TR000114 , and AHRQ grant R21HS017622-01 . Publisher Copyright: {\textcopyright} 2016 Elsevier Inc.",

year = "2016",

month = jun,

day = "1",

doi = "10.1016/j.jbi.2016.03.009",

language = "English (US)",

volume = "61",

pages = "119--131",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Adapting machine learning techniques to censored time-to-event health record data

T2 - A general-purpose approach using inverse probability of censoring weighting

AU - Vock, David M

AU - Wolfson, Julian

AU - Bandyopadhyay, Sunayan

AU - Adomavicius, Gediminas

AU - Johnson, Paul E.

AU - Vazquez-Benitez, Gabriela

AU - O'Connor, Patrick J.

PY - 2016/6/1

Y1 - 2016/6/1

N2 - Models for predicting the probability of experiencing various health outcomes or adverse events over a certain time frame (e.g., having a heart attack in the next 5 years) based on individual patient characteristics are important tools for managing patient care. Electronic health data (EHD) are appealing sources of training data because they provide access to large amounts of rich individual-level data from present-day patient populations. However, because EHD are derived by extracting information from administrative and clinical databases, some fraction of subjects will not be under observation for the entire time frame over which one wants to make predictions; this loss to follow-up is often due to disenrollment from the health system. For subjects without complete follow-up, whether or not they experienced the adverse event is unknown, and in statistical terms the event time is said to be right-censored. Most machine learning approaches to the problem have been relatively ad hoc; for example, common approaches for handling observations in which the event status is unknown include (1) discarding those observations, (2) treating them as non-events, (3) splitting those observations into two observations: one where the event occurs and one where the event does not. In this paper, we present a general-purpose approach to account for right-censored outcomes using inverse probability of censoring weighting (IPCW). We illustrate how IPCW can easily be incorporated into a number of existing machine learning algorithms used to mine big health care data including Bayesian networks, k-nearest neighbors, decision trees, and generalized additive models. We then show that our approach leads to better calibrated predictions than the three ad hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event, using EHD from a large U.S. Midwestern healthcare system.

AB - Models for predicting the probability of experiencing various health outcomes or adverse events over a certain time frame (e.g., having a heart attack in the next 5 years) based on individual patient characteristics are important tools for managing patient care. Electronic health data (EHD) are appealing sources of training data because they provide access to large amounts of rich individual-level data from present-day patient populations. However, because EHD are derived by extracting information from administrative and clinical databases, some fraction of subjects will not be under observation for the entire time frame over which one wants to make predictions; this loss to follow-up is often due to disenrollment from the health system. For subjects without complete follow-up, whether or not they experienced the adverse event is unknown, and in statistical terms the event time is said to be right-censored. Most machine learning approaches to the problem have been relatively ad hoc; for example, common approaches for handling observations in which the event status is unknown include (1) discarding those observations, (2) treating them as non-events, (3) splitting those observations into two observations: one where the event occurs and one where the event does not. In this paper, we present a general-purpose approach to account for right-censored outcomes using inverse probability of censoring weighting (IPCW). We illustrate how IPCW can easily be incorporated into a number of existing machine learning algorithms used to mine big health care data including Bayesian networks, k-nearest neighbors, decision trees, and generalized additive models. We then show that our approach leads to better calibrated predictions than the three ad hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event, using EHD from a large U.S. Midwestern healthcare system.

KW - Censored data

KW - Electronic health data

KW - Inverse probability weighting

KW - Machine learning

KW - Risk prediction

KW - Survival analysis

UR - http://www.scopus.com/inward/record.url?scp=84962440559&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84962440559&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2016.03.009

DO - 10.1016/j.jbi.2016.03.009

M3 - Article

C2 - 26992568

AN - SCOPUS:84962440559

SN - 1532-0464

VL - 61

SP - 119

EP - 131

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

ER -

Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting

Abstract

Bibliographical note

Keywords

UN SDGs

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this