High dimensional, robust, unsupervised record linkage

Sabyasachi Bera, Snigdhansu Chatterjee

Research output: Contribution to journalArticlepeer-review


We develop a technique for record linkage on high dimensional data, where the two datasets may not have any common variable, and there may be no training set available. Our methodology is based on sparse, high dimensional principal components. Since large and high dimensional datasets are often prone to outliers and aberrant observations, we propose a technique for estimating robust, high dimensional principal components. We present theoretical results validating the robust, high dimensional principal component estimation steps, and justifying their use for record linkage. Some numeric results and remarks are also presented.

Original languageEnglish (US)
Pages (from-to)123-143
Number of pages21
JournalStatistics in Transition
Issue number4
StatePublished - Sep 14 2020
Externally publishedYes

Bibliographical note

Funding Information:
This research is partially supported by the US National Science Foundation (NSF) under grants # DMS-1622483, # DMS-1737918, # OAC-1939916 and #DMR-1939956.


  • High dimensional
  • Principal components
  • Record linkage
  • Robust

Fingerprint Dive into the research topics of 'High dimensional, robust, unsupervised record linkage'. Together they form a unique fingerprint.

Cite this