High dimensional, robust, unsupervised record linkage

Sabyasachi Bera; Snigdhansu Chatterjee

doi:10.21307/STATTRANS-2020-034

High dimensional, robust, unsupervised record linkage

Sabyasachi Bera, Snigdhansu Chatterjee

Statistics (Twin Cities)

Research output: Contribution to journal › Article › peer-review

Abstract

We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented.

Original language	English (US)
Pages (from-to)	123-143
Number of pages	21
Journal	Statistics in Transition New Series
Volume	21
Issue number	4
DOIs	https://doi.org/10.21307/STATTRANS-2020-034
State	Published - Sep 14 2020

Bibliographical note

Publisher Copyright:
© 2020 Glowny Urzad Statystyczny. All rights reserved.

Keywords

High dimensional
Principal components
Record linkage
Robust

Access

10.21307/STATTRANS-2020-034

OpenUrl availability

Full text

Cite this

@article{bd12cb839a5e46298873827c52ae6a47,

title = "High dimensional, robust, unsupervised record linkage",

abstract = "We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented.",

keywords = "High dimensional, Principal components, Record linkage, Robust",

author = "Sabyasachi Bera and Snigdhansu Chatterjee",

year = "2020",

month = sep,

day = "14",

doi = "10.21307/STATTRANS-2020-034",

language = "English (US)",

volume = "21",

pages = "123--143",

journal = "Statistics in Transition New Series",

issn = "1234-7655",

publisher = "Central Statistical Office of Poland",

number = "4",

}

TY - JOUR

T1 - High dimensional, robust, unsupervised record linkage

AU - Bera, Sabyasachi

AU - Chatterjee, Snigdhansu

PY - 2020/9/14

Y1 - 2020/9/14

N2 - We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented.

AB - We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented.

KW - High dimensional

KW - Principal components

KW - Record linkage

KW - Robust

UR - http://www.scopus.com/inward/record.url?scp=85092140684&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85092140684&partnerID=8YFLogxK

U2 - 10.21307/STATTRANS-2020-034

DO - 10.21307/STATTRANS-2020-034

M3 - Article

AN - SCOPUS:85092140684

SN - 1234-7655

VL - 21

SP - 123

EP - 143

JO - Statistics in Transition New Series

JF - Statistics in Transition New Series

IS - 4

ER -

High dimensional, robust, unsupervised record linkage

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this