The effects of standardizing names for record linkage: Evidence from the United States and Norway

Rebecca Vick, Lap Huynh

Research output: Contribution to journalArticlepeer-review

11 Scopus citations

Abstract

Standardizing name strings before searching for links is common practice during record linkage and is generally believed to increase the size and quality of linked data sets. In this article, the authors quantify the impact of name standardization on historical record linkage, using data from nineteenth-century censuses of the United States and Norway as test cases.

Original languageEnglish (US)
Pages (from-to)15-24
Number of pages10
JournalHistorical Methods
Volume44
Issue number1
DOIs
StatePublished - Jan 2011

Bibliographical note

Funding Information:
Fourth, and perhaps most important, the U.S. and Norwegian linkage projects used standard name dictionaries that were constructed with different levels of expertise, time, and effort. As previously reported, in the case of the United States, first-name strings that occur within any of the relevant data sets a minimum of 100 times were considered for standardization, and surname strings were never standardized. By contrast, collaborators in Norway (most notably, Gulbrand Alhaug and Bente Ramsvik of the University of Troms) created a standard-names dictionary for every unique first-name string encountered in all databases for the censuses of 1865, 1875, and 1900, and all unique surname strings encountered in the databases for 1865 and 1875. The standardization project was funded by a three-year grant from the Norwegian Research Council with the purpose, as described by Dr. Gulbrand Alhaug (pers. comm., October 31, 2009), of simplifying the variation problems for people (genealogists, local historians, etc.) looking for individuals in the digital censuses. Our Norwegian collaborators were aware of the project and considered the dictionaries to be appropriate for record linkage. For the Norwegian data, nearly 53,000 unique first-name strings existed after we cleaned the names of unwanted characters; 32,185 (61 percent) of these strings were amended for standardization. For surnames, more than 83,000 unique “cleaned” strings existed in the Norwegian data, 9,236 (11 percent) of which were amended. For purposes of comparison with the U.S. data, in table 2, we show the 12 most common male first-name strings for Norway and their Jaro-Winkler scores without standardization.2

Funding Information:
2. Some of the work of Norwegian name standardization was done systematically, and some was done manually. The project was funded by a three-year grant funded by the Norwegian Research Council. The purpose of the standardization project, as described by Alhaug (pers. comm., October 31, 2009), was not to produce standard name dictionaries for record linkage, but instead to simplify the variation problems for people (e.g., genealogists and “local” historians) who want to look for specific persons in the digital censuses. In any case, it is sensible to regard the use of these databases as appropriate for name standardization for record linkage.

Keywords

  • Norway
  • United States
  • census
  • historical demography
  • microdata
  • record linkage

Fingerprint

Dive into the research topics of 'The effects of standardizing names for record linkage: Evidence from the United States and Norway'. Together they form a unique fingerprint.

Cite this