Generative model-based clustering of directional data

Arindam Banerjee, Inderjit Dhillon, Joydeep Ghosh, Suvrit Sra

Research output: Chapter in Book/Report/Conference proceedingConference contribution

80 Scopus citations

Abstract

High dimensional directional data is becoming increasingly important in contemporary applications such as analysis of text and gene-expression data. A natural model for multi-variate directional data is provided by the von Mises-Fisher (vMF) distribution on the unit hypersphere that is analogous to the multi-variate Gaussian distribution in ℝd. In this paper, we propose modeling complex directional data as a mixture of vMF distributions. We derive and analyze two variants of the Expectation Maximization (EM) framework for estimating the parameters of this mixture. We also propose two clustering algorithms corresponding to these variants. An interesting aspect of our methodology is that the spherical kmeans algorithm (kmeans with cosine similarity) can be shown to be a special case of both our algorithms. Thus, modeling text data by vMF distributions lends theoretical validity to the use of cosine similarity which has been widely used by the information retrieval community. As part of experimental validation, we present results on modeling high-dimensional text and gene-expression data as a mixture of vMF distributions. The results indicate that our approach yields superior clusterings especially for difficult clustering tasks in high-dimensional spaces.

Original languageEnglish (US)
Title of host publicationProceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
Pages19-28
Number of pages10
DOIs
StatePublished - Dec 1 2003
Event9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03 - Washington, DC, United States
Duration: Aug 24 2003Aug 27 2003

Other

Other9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '03
Country/TerritoryUnited States
CityWashington, DC
Period8/24/038/27/03

Keywords

  • Clustering
  • Directional data
  • EM
  • Mixtures
  • Von Mises-Fisher

Fingerprint

Dive into the research topics of 'Generative model-based clustering of directional data'. Together they form a unique fingerprint.

Cite this