Standardizing name strings before searching for links is common practice during record linkage and is generally believed to increase the size and quality of linked data sets. In this article, the authors quantify the impact of name standardization on historical record linkage, using data from nineteenth-century censuses of the United States and Norway as test cases.
Bibliographical noteFunding Information:
Fourth, and perhaps most important, the U.S. and Norwegian linkage projects used standard name dictionaries that were constructed with different levels of expertise, time, and effort. As previously reported, in the case of the United States, first-name strings that occur within any of the relevant data sets a minimum of 100 times were considered for standardization, and surname strings were never standardized. By contrast, collaborators in Norway (most notably, Gulbrand Alhaug and Bente Ramsvik of the University of Troms) created a standard-names dictionary for every unique first-name string encountered in all databases for the censuses of 1865, 1875, and 1900, and all unique surname strings encountered in the databases for 1865 and 1875. The standardization project was funded by a three-year grant from the Norwegian Research Council with the purpose, as described by Dr. Gulbrand Alhaug (pers. comm., October 31, 2009), of simplifying the variation problems for people (genealogists, local historians, etc.) looking for individuals in the digital censuses. Our Norwegian collaborators were aware of the project and considered the dictionaries to be appropriate for record linkage. For the Norwegian data, nearly 53,000 unique first-name strings existed after we cleaned the names of unwanted characters; 32,185 (61 percent) of these strings were amended for standardization. For surnames, more than 83,000 unique “cleaned” strings existed in the Norwegian data, 9,236 (11 percent) of which were amended. For purposes of comparison with the U.S. data, in table 2, we show the 12 most common male first-name strings for Norway and their Jaro-Winkler scores without standardization.2
2. Some of the work of Norwegian name standardization was done systematically, and some was done manually. The project was funded by a three-year grant funded by the Norwegian Research Council. The purpose of the standardization project, as described by Alhaug (pers. comm., October 31, 2009), was not to produce standard name dictionaries for record linkage, but instead to simplify the variation problems for people (e.g., genealogists and “local” historians) who want to look for specific persons in the digital censuses. In any case, it is sensible to regard the use of these databases as appropriate for name standardization for record linkage.
- United States
- historical demography
- record linkage