A scalable algorithm for clustering sequential data

Valerie Guralnik; George Karypis

A scalable algorithm for clustering sequential data

Valerie Guralnik, George Karypis

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

81 Scopus citations

Abstract

In recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions, intrusion detection, and web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify different customer groups based upon their purchasing patterns. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Over the years, many methods have been developed for clustering objects according to their similarity. However these methods tend to have a computational complexity that is at least quadratic on the number of sequences. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexiy K -means based clustering algorithm. Our experiments using data sets derived from sequences of purchasing transactions and protein sequences show that this approach is scalable and leads to reasonably good clusters.

Original language	English (US)
Title of host publication	Proceedings - 2001 IEEE International Conference on Data Mining, ICDM'01
Pages	179-186
Number of pages	8
State	Published - 2001
Event	1st IEEE International Conference on Data Mining, ICDM'01 - San Jose, CA, United States Duration: Nov 29 2001 → Dec 2 2001

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)	1550-4786

Other

Other	1st IEEE International Conference on Data Mining, ICDM'01
Country/Territory	United States
City	San Jose, CA
Period	11/29/01 → 12/2/01

OpenUrl availability

Full text

Cite this

@inproceedings{f5b72ba10fbd4ad29126e4f8a96e2e93,

title = "A scalable algorithm for clustering sequential data",

abstract = "In recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions, intrusion detection, and web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify different customer groups based upon their purchasing patterns. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Over the years, many methods have been developed for clustering objects according to their similarity. However these methods tend to have a computational complexity that is at least quadratic on the number of sequences. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexiy K -means based clustering algorithm. Our experiments using data sets derived from sequences of purchasing transactions and protein sequences show that this approach is scalable and leads to reasonably good clusters.",

author = "Valerie Guralnik and George Karypis",

year = "2001",

language = "English (US)",

isbn = "0769511198",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

pages = "179--186",

booktitle = "Proceedings - 2001 IEEE International Conference on Data Mining, ICDM'01",

note = "1st IEEE International Conference on Data Mining, ICDM'01 ; Conference date: 29-11-2001 Through 02-12-2001",

}

TY - GEN

T1 - A scalable algorithm for clustering sequential data

AU - Guralnik, Valerie

AU - Karypis, George

PY - 2001

Y1 - 2001

N2 - In recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions, intrusion detection, and web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify different customer groups based upon their purchasing patterns. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Over the years, many methods have been developed for clustering objects according to their similarity. However these methods tend to have a computational complexity that is at least quadratic on the number of sequences. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexiy K -means based clustering algorithm. Our experiments using data sets derived from sequences of purchasing transactions and protein sequences show that this approach is scalable and leads to reasonably good clusters.

AB - In recent years, we have seen an enormous growth in the amount of available commercial and scientific data. Data from domains such as protein sequences, retail transactions, intrusion detection, and web-logs have an inherent sequential nature. Clustering of such data sets is useful for various purposes. For example, clustering of sequences from commercial data sets may help marketer identify different customer groups based upon their purchasing patterns. Grouping protein sequences that share similar structure helps in identifying sequences with similar functionality. Over the years, many methods have been developed for clustering objects according to their similarity. However these methods tend to have a computational complexity that is at least quadratic on the number of sequences. In this paper we present an entirely different approach to sequence clustering that does not require an all-against-all analysis and uses a near-linear complexiy K -means based clustering algorithm. Our experiments using data sets derived from sequences of purchasing transactions and protein sequences show that this approach is scalable and leads to reasonably good clusters.

UR - http://www.scopus.com/inward/record.url?scp=33751181595&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=33751181595&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:33751181595

SN - 0769511198

SN - 9780769511191

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 179

EP - 186

BT - Proceedings - 2001 IEEE International Conference on Data Mining, ICDM'01

T2 - 1st IEEE International Conference on Data Mining, ICDM'01

Y2 - 29 November 2001 through 2 December 2001

ER -

A scalable algorithm for clustering sequential data

Abstract

Publication series

Other

OpenUrl availability

Other files and links

Fingerprint

Cite this