RSQRT: An heuristic for estimating the number of clusters to report

John Carlis, Kelsey Bruso

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Clustering can be a valuable tool for analyzing large data sets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from the UCI machine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n 2) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.

Original languageEnglish (US)
Pages (from-to)152-158
Number of pages7
JournalElectronic Commerce Research and Applications
Volume11
Issue number2
DOIs
StatePublished - Mar 2012

Bibliographical note

Funding Information:
This work was supported in part by the US NIH Grant 1R01DE017734 .

Keywords

  • Bayesian information criterion
  • Clustering
  • Data analytics
  • E-commerce
  • Heuristic
  • Spiral visualization

Fingerprint

Dive into the research topics of 'RSQRT: An heuristic for estimating the number of clusters to report'. Together they form a unique fingerprint.

Cite this