The effects of standardizing names for record linkage: Evidence from the United States and Norway

Rebecca Vick; Lap Huynh

doi:10.1080/01615440.2010.514849

The effects of standardizing names for record linkage: Evidence from the United States and Norway

Rebecca Vick, Lap Huynh

Institute for Social Research & Data Innovation

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Standardizing name strings before searching for links is common practice during record linkage and is generally believed to increase the size and quality of linked data sets. In this article, the authors quantify the impact of name standardization on historical record linkage, using data from nineteenth-century censuses of the United States and Norway as test cases.

Original language	English (US)
Pages (from-to)	15-24
Number of pages	10
Journal	Historical Methods
Volume	44
Issue number	1
DOIs	https://doi.org/10.1080/01615440.2010.514849
State	Published - Jan 2011

Bibliographical note

Funding Information:
Fourth, and perhaps most important, the U.S. and Norwegian linkage projects used standard name dictionaries that were constructed with different levels of expertise, time, and effort. As previously reported, in the case of the United States, first-name strings that occur within any of the relevant data sets a minimum of 100 times were considered for standardization, and surname strings were never standardized. By contrast, collaborators in Norway (most notably, Gulbrand Alhaug and Bente Ramsvik of the University of Troms) created a standard-names dictionary for every unique first-name string encountered in all databases for the censuses of 1865, 1875, and 1900, and all unique surname strings encountered in the databases for 1865 and 1875. The standardization project was funded by a three-year grant from the Norwegian Research Council with the purpose, as described by Dr. Gulbrand Alhaug (pers. comm., October 31, 2009), of simplifying the variation problems for people (genealogists, local historians, etc.) looking for individuals in the digital censuses. Our Norwegian collaborators were aware of the project and considered the dictionaries to be appropriate for record linkage. For the Norwegian data, nearly 53,000 unique first-name strings existed after we cleaned the names of unwanted characters; 32,185 (61 percent) of these strings were amended for standardization. For surnames, more than 83,000 unique “cleaned” strings existed in the Norwegian data, 9,236 (11 percent) of which were amended. For purposes of comparison with the U.S. data, in table 2, we show the 12 most common male first-name strings for Norway and their Jaro-Winkler scores without standardization.2

Funding Information:
2. Some of the work of Norwegian name standardization was done systematically, and some was done manually. The project was funded by a three-year grant funded by the Norwegian Research Council. The purpose of the standardization project, as described by Alhaug (pers. comm., October 31, 2009), was not to produce standard name dictionaries for record linkage, but instead to simplify the variation problems for people (e.g., genealogists and “local” historians) who want to look for specific persons in the digital censuses. In any case, it is sensible to regard the use of these databases as appropriate for name standardization for record linkage.

Keywords

Norway
United States
census
historical demography
microdata
record linkage

Access

10.1080/01615440.2010.514849

OpenUrl availability

Full text

Cite this

@article{0d2ce62ef1f64aaca00af30ab7a233b0,

title = "The effects of standardizing names for record linkage: Evidence from the United States and Norway",

abstract = "Standardizing name strings before searching for links is common practice during record linkage and is generally believed to increase the size and quality of linked data sets. In this article, the authors quantify the impact of name standardization on historical record linkage, using data from nineteenth-century censuses of the United States and Norway as test cases.",

keywords = "Norway, United States, census, historical demography, microdata, record linkage",

author = "Rebecca Vick and Lap Huynh",

note = "Funding Information: Fourth, and perhaps most important, the U.S. and Norwegian linkage projects used standard name dictionaries that were constructed with different levels of expertise, time, and effort. As previously reported, in the case of the United States, first-name strings that occur within any of the relevant data sets a minimum of 100 times were considered for standardization, and surname strings were never standardized. By contrast, collaborators in Norway (most notably, Gulbrand Alhaug and Bente Ramsvik of the University of Troms) created a standard-names dictionary for every unique first-name string encountered in all databases for the censuses of 1865, 1875, and 1900, and all unique surname strings encountered in the databases for 1865 and 1875. The standardization project was funded by a three-year grant from the Norwegian Research Council with the purpose, as described by Dr. Gulbrand Alhaug (pers. comm., October 31, 2009), of simplifying the variation problems for people (genealogists, local historians, etc.) looking for individuals in the digital censuses. Our Norwegian collaborators were aware of the project and considered the dictionaries to be appropriate for record linkage. For the Norwegian data, nearly 53,000 unique first-name strings existed after we cleaned the names of unwanted characters; 32,185 (61 percent) of these strings were amended for standardization. For surnames, more than 83,000 unique “cleaned” strings existed in the Norwegian data, 9,236 (11 percent) of which were amended. For purposes of comparison with the U.S. data, in table 2, we show the 12 most common male first-name strings for Norway and their Jaro-Winkler scores without standardization.2 Funding Information: 2. Some of the work of Norwegian name standardization was done systematically, and some was done manually. The project was funded by a three-year grant funded by the Norwegian Research Council. The purpose of the standardization project, as described by Alhaug (pers. comm., October 31, 2009), was not to produce standard name dictionaries for record linkage, but instead to simplify the variation problems for people (e.g., genealogists and “local” historians) who want to look for specific persons in the digital censuses. In any case, it is sensible to regard the use of these databases as appropriate for name standardization for record linkage.",

year = "2011",

month = jan,

doi = "10.1080/01615440.2010.514849",

language = "English (US)",

volume = "44",

pages = "15--24",

journal = "Historical Methods",

issn = "0161-5440",

publisher = "Routledge",

number = "1",

}

TY - JOUR

T1 - The effects of standardizing names for record linkage

T2 - Evidence from the United States and Norway

AU - Vick, Rebecca

AU - Huynh, Lap

N1 - Funding Information: Fourth, and perhaps most important, the U.S. and Norwegian linkage projects used standard name dictionaries that were constructed with different levels of expertise, time, and effort. As previously reported, in the case of the United States, first-name strings that occur within any of the relevant data sets a minimum of 100 times were considered for standardization, and surname strings were never standardized. By contrast, collaborators in Norway (most notably, Gulbrand Alhaug and Bente Ramsvik of the University of Troms) created a standard-names dictionary for every unique first-name string encountered in all databases for the censuses of 1865, 1875, and 1900, and all unique surname strings encountered in the databases for 1865 and 1875. The standardization project was funded by a three-year grant from the Norwegian Research Council with the purpose, as described by Dr. Gulbrand Alhaug (pers. comm., October 31, 2009), of simplifying the variation problems for people (genealogists, local historians, etc.) looking for individuals in the digital censuses. Our Norwegian collaborators were aware of the project and considered the dictionaries to be appropriate for record linkage. For the Norwegian data, nearly 53,000 unique first-name strings existed after we cleaned the names of unwanted characters; 32,185 (61 percent) of these strings were amended for standardization. For surnames, more than 83,000 unique “cleaned” strings existed in the Norwegian data, 9,236 (11 percent) of which were amended. For purposes of comparison with the U.S. data, in table 2, we show the 12 most common male first-name strings for Norway and their Jaro-Winkler scores without standardization.2 Funding Information: 2. Some of the work of Norwegian name standardization was done systematically, and some was done manually. The project was funded by a three-year grant funded by the Norwegian Research Council. The purpose of the standardization project, as described by Alhaug (pers. comm., October 31, 2009), was not to produce standard name dictionaries for record linkage, but instead to simplify the variation problems for people (e.g., genealogists and “local” historians) who want to look for specific persons in the digital censuses. In any case, it is sensible to regard the use of these databases as appropriate for name standardization for record linkage.

PY - 2011/1

Y1 - 2011/1

N2 - Standardizing name strings before searching for links is common practice during record linkage and is generally believed to increase the size and quality of linked data sets. In this article, the authors quantify the impact of name standardization on historical record linkage, using data from nineteenth-century censuses of the United States and Norway as test cases.

AB - Standardizing name strings before searching for links is common practice during record linkage and is generally believed to increase the size and quality of linked data sets. In this article, the authors quantify the impact of name standardization on historical record linkage, using data from nineteenth-century censuses of the United States and Norway as test cases.

KW - Norway

KW - United States

KW - census

KW - historical demography

KW - microdata

KW - record linkage

UR - http://www.scopus.com/inward/record.url?scp=79952012366&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=79952012366&partnerID=8YFLogxK

U2 - 10.1080/01615440.2010.514849

DO - 10.1080/01615440.2010.514849

M3 - Article

AN - SCOPUS:79952012366

SN - 0161-5440

VL - 44

SP - 15

EP - 24

JO - Historical Methods

JF - Historical Methods

IS - 1

ER -

The effects of standardizing names for record linkage: Evidence from the United States and Norway

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this