Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

Xu Shi; Xiaoou Li; Tianxi Cai

doi:10.1080/01621459.2020.1752219

Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

Xu Shi, Xiaoou Li, Tianxi Cai

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

6 Scopus citations

Abstract

Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix (Formula presented.) ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for (Formula presented.) by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of (Formula presented.) in both fixed and high-dimensional setting. We demonstrate that the refined estimate of (Formula presented.) achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. Supplementary materials for this article are available online.

Original language	English (US)
Pages (from-to)	1953-1964
Number of pages	12
Journal	Journal of the American Statistical Association
Volume	116
Issue number	536
DOIs	https://doi.org/10.1080/01621459.2020.1752219
State	Published - 2021

Bibliographical note

Funding Information:
Research reported in this publication was partially supported by the National Science Foundation (award DMS-1712657, to Xiaoou Li). We thank the associate editor and two referees for their helpful comments.

Publisher Copyright:
© 2020 American Statistical Association.

Keywords

Electronic health records
Hard-thresholding
Mismatched data
Ontology translation
Spherical regression

Access

10.1080/01621459.2020.1752219

OpenUrl availability

Full text

Cite this

@article{4de051a09380481e99394c5bde6726d3,

title = "Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation",

abstract = "Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix (Formula presented.) ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for (Formula presented.) by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of (Formula presented.) in both fixed and high-dimensional setting. We demonstrate that the refined estimate of (Formula presented.) achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. Supplementary materials for this article are available online.",

keywords = "Electronic health records, Hard-thresholding, Mismatched data, Ontology translation, Spherical regression",

author = "Xu Shi and Xiaoou Li and Tianxi Cai",

note = "Funding Information: Research reported in this publication was partially supported by the National Science Foundation (award DMS-1712657, to Xiaoou Li). We thank the associate editor and two referees for their helpful comments. Publisher Copyright: {\textcopyright} 2020 American Statistical Association.",

year = "2021",

doi = "10.1080/01621459.2020.1752219",

language = "English (US)",

volume = "116",

pages = "1953--1964",

journal = "Journal of the American Statistical Association",

issn = "0162-1459",

publisher = "Taylor and Francis Ltd.",

number = "536",

}

TY - JOUR

T1 - Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

AU - Shi, Xu

AU - Li, Xiaoou

AU - Cai, Tianxi

N1 - Funding Information: Research reported in this publication was partially supported by the National Science Foundation (award DMS-1712657, to Xiaoou Li). We thank the associate editor and two referees for their helpful comments. Publisher Copyright: © 2020 American Statistical Association.

PY - 2021

Y1 - 2021

N2 - Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix (Formula presented.) ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for (Formula presented.) by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of (Formula presented.) in both fixed and high-dimensional setting. We demonstrate that the refined estimate of (Formula presented.) achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. Supplementary materials for this article are available online.

AB - Motivated by a series of applications in data integration, language translation, bioinformatics, and computer vision, we consider spherical regression with two sets of unit-length vectors when the data are corrupted by a small fraction of mismatch in the response-predictor pairs. We propose a three-step algorithm in which we initialize the parameters by solving an orthogonal Procrustes problem to estimate a translation matrix (Formula presented.) ignoring the mismatch. We then estimate a mapping matrix aiming to correct the mismatch using hard-thresholding to induce sparsity, while incorporating potential group information. We eventually obtain a refined estimate for (Formula presented.) by removing the estimated mismatched pairs. We derive the error bound for the initial estimate of (Formula presented.) in both fixed and high-dimensional setting. We demonstrate that the refined estimate of (Formula presented.) achieves an error rate that is as good as if no mismatch is present. We show that our mapping recovery method not only correctly distinguishes one-to-one and one-to-many correspondences, but also consistently identifies the matched pairs and estimates the weight vector for combined correspondence. We examine the finite sample performance of the proposed method via extensive simulation studies, and with application to the unsupervised translation of medical codes using electronic health records data. Supplementary materials for this article are available online.

KW - Electronic health records

KW - Hard-thresholding

KW - Mismatched data

KW - Ontology translation

KW - Spherical regression

UR - http://www.scopus.com/inward/record.url?scp=85084842104&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85084842104&partnerID=8YFLogxK

U2 - 10.1080/01621459.2020.1752219

DO - 10.1080/01621459.2020.1752219

M3 - Article

AN - SCOPUS:85084842104

SN - 0162-1459

VL - 116

SP - 1953

EP - 1964

JO - Journal of the American Statistical Association

JF - Journal of the American Statistical Association

IS - 536

ER -

Spherical Regression Under Mismatch Corruption With Application to Automated Knowledge Translation

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this