RSQRT: An heuristic for estimating the number of clusters to report

John Carlis; Kelsey Bruso

doi:10.1016/j.elerap.2011.12.006

RSQRT: An heuristic for estimating the number of clusters to report

John Carlis, Kelsey Bruso

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Clustering can be a valuable tool for analyzing large data sets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from the UCI machine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n ²) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.

Original language	English (US)
Pages (from-to)	152-158
Number of pages	7
Journal	Electronic Commerce Research and Applications
Volume	11
Issue number	2
DOIs	https://doi.org/10.1016/j.elerap.2011.12.006
State	Published - Mar 2012

Bibliographical note

Funding Information:
This work was supported in part by the US NIH Grant 1R01DE017734 .

Keywords

Bayesian information criterion
Clustering
Data analytics
E-commerce
Heuristic
Spiral visualization

Access

10.1016/j.elerap.2011.12.006

OpenUrl availability

Full text

Cite this

@article{b79b630299be4b85ae329911618ac44d,

title = "RSQRT: An heuristic for estimating the number of clusters to report",

abstract = "Clustering can be a valuable tool for analyzing large data sets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from the UCI machine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n 2) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.",

keywords = "Bayesian information criterion, Clustering, Data analytics, E-commerce, Heuristic, Spiral visualization",

author = "John Carlis and Kelsey Bruso",

note = "Funding Information: This work was supported in part by the US NIH Grant 1R01DE017734 . ",

year = "2012",

month = mar,

doi = "10.1016/j.elerap.2011.12.006",

language = "English (US)",

volume = "11",

pages = "152--158",

journal = "Electronic Commerce Research and Applications",

issn = "1567-4223",

publisher = "Elsevier",

number = "2",

}

TY - JOUR

T1 - RSQRT

T2 - An heuristic for estimating the number of clusters to report

AU - Carlis, John

AU - Bruso, Kelsey

N1 - Funding Information: This work was supported in part by the US NIH Grant 1R01DE017734 .

PY - 2012/3

Y1 - 2012/3

N2 - Clustering can be a valuable tool for analyzing large data sets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from the UCI machine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n 2) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.

AB - Clustering can be a valuable tool for analyzing large data sets, such as in e-commerce applications. Anyone who clusters must choose how many item clusters, K, to report. Unfortunately, one must guess at K or some related parameter. Elsewhere we introduced a strongly-supported heuristic, RSQRT, which predicts K as a function of the attribute or item count, depending on attribute scales. We conducted a second analysis where we sought confirmation of the heuristic, analyzing data sets from the UCI machine learning benchmark repository. For the 25 studies where sufficient detail was available, we again found strong support. Also, in a side-by-side comparison of 28 studies, RSQRT best-predicted K and the Bayesian information criterion (BIC) predicted K are the same. RSQRT has a lower cost of O(log log n) versus O(n 2) for BIC, and is more widely applicable. Using RSQRT prospectively could be much better than merely guessing.

KW - Bayesian information criterion

KW - Clustering

KW - Data analytics

KW - E-commerce

KW - Heuristic

KW - Spiral visualization

UR - http://www.scopus.com/inward/record.url?scp=84859766294&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84859766294&partnerID=8YFLogxK

U2 - 10.1016/j.elerap.2011.12.006

DO - 10.1016/j.elerap.2011.12.006

M3 - Article

C2 - 22773923

AN - SCOPUS:84859766294

SN - 1567-4223

VL - 11

SP - 152

EP - 158

JO - Electronic Commerce Research and Applications

JF - Electronic Commerce Research and Applications

IS - 2

ER -

RSQRT: An heuristic for estimating the number of clusters to report

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this