Content-based methods for predicting web-site demographic attributes

Santosh Kabbur, Eui Hong Han, George Karypis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

16 Scopus citations


Demographic information plays an important role in gaining valuable insights about a web-site's user-base and is used extensively to target online advertisements and promotions. This paper investigates machine-learning approaches for predicting the demographic attributes of web-sites using information derived from their content and their hyperlinked structure and not relying on any information directly or indirectly obtained from the web-site's users. Such methods are important because users are becoming increasingly more concerned about sharing their personal and behavioral information on the Internet. Regression-based approaches are developed and studied for predicting demographic attributes that utilize different content-derived features, different ways of building the prediction models, and different ways of aggregating web-page level predictions that take into account the web's hyperlinked structure. In addition, a matrix-approximation based approach is developed for coupling the predictions of individual regression models into a model designed to predict the probability mass function of the attribute. Extensive experiments show that these methods are able to achieve an RMSE of 8-10% and provide insights on how to best train and apply such models.

Original languageEnglish (US)
Title of host publicationProceedings - 10th IEEE International Conference on Data Mining, ICDM 2010
Number of pages6
StatePublished - 2010
Event10th IEEE International Conference on Data Mining, ICDM 2010 - Sydney, NSW, Australia
Duration: Dec 14 2010Dec 17 2010

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786


Other10th IEEE International Conference on Data Mining, ICDM 2010
CitySydney, NSW


  • Content based models
  • Demographic attribute prediction
  • Inlink count
  • Probability mass function
  • Regression

Fingerprint Dive into the research topics of 'Content-based methods for predicting web-site demographic attributes'. Together they form a unique fingerprint.

Cite this