A primer on theory-driven web scraping: Automatic extraction of big data from the internet for use in psychological research

Richard N. Landers, Robert C. Brusso, Katelyn J. Cavanaugh, Andrew B. Collmus

Research output: Contribution to journalArticlepeer-review

124 Scopus citations

Abstract

The term big data encompasses a wide range of approaches of collecting and analyzing data in ways that were not possible before the era of modern personal computing. One approach to big data of great potential to psychologists is web scraping, which involves the automated collection of information from webpages. Although web scraping can create massive big datasets with tens of thousands of variables, it can also be used to create modestly sized, more manageable datasets with tens of variables but hundreds of thousands of cases, well within the skillset of most psychologists to analyze, in a matter of hours. In this article, we demystify web scraping methods as currently used to examine research questions of interest to psychologists. First, we introduce an approach called theory-driven web scraping in which the choice to use web-based big data must follow substantive theory. Second, we introduce data source theories, a term used to describe the assumptions a researcher must make about a prospective big data source in order to meaningfully scrape data from it. Critically, researchers must derive specific hypotheses to be tested based upon their data source theory, and if these hypotheses are not empirically supported, plans to use that data source should be changed or eliminated. Third, we provide a case study and sample code in Python demonstrating how web scraping can be conducted to collect big data along with links to a web tutorial designed for psychologists. Fourth, we describe a 4-step process to be followed in web scraping projects. Fifth and finally, we discuss legal, practical and ethical concerns faced when conducting web scraping projects.

Original languageEnglish (US)
Pages (from-to)475-492
Number of pages18
JournalPsychological Methods
Volume21
Issue number4
DOIs
StatePublished - Dec 1 2016
Externally publishedYes

Keywords

  • Big data
  • Data source theory
  • Python
  • Tutorial
  • Web scraping

Fingerprint

Dive into the research topics of 'A primer on theory-driven web scraping: Automatic extraction of big data from the internet for use in psychological research'. Together they form a unique fingerprint.

Cite this