Abstract
The K-means clustering problem seeks to partition the columns of a data matrix in subsets, such that columns in the same subset are 'close' to each other. The co-clustering problem seeks to simultaneously partition the rows and columns of a matrix to produce 'coherent' groups called co-clusters. Co-clustering has recently found numerous applications in diverse areas. The concept readily generalizes to higher-way data sets (e.g., adding a temporal dimension). Starting from K-means, we show how co-clustering can be formulated as constrained multilinear decomposition with sparse latent factors. In the case of three- and higher-way data, this corresponds to a PARAFAC decomposition with sparse latent factors. This is important, for PARAFAC is unique under mild conditions - and sparsity further improves identifiability. This allows us to uniquely unravel a large number of possibly overlapping co-clusters that are hidden in the data. Interestingly, the imposition of latent sparsity pays a collateral dividend: as one increases the number of fitted co-clusters, new co-clusters are added without affecting those previously extracted. An important corollary is that co-clusters can be extracted incrementally; this implies that the algorithm scales well for large datasets. We demonstrate the validity of our approach using the ENRON corpus, as well as synthetic data.
Original language | English (US) |
---|---|
Title of host publication | 2011 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Proceedings |
Pages | 2064-2067 |
Number of pages | 4 |
DOIs | |
State | Published - Aug 18 2011 |
Event | 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 - Prague, Czech Republic Duration: May 22 2011 → May 27 2011 |
Other
Other | 36th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011 |
---|---|
Country/Territory | Czech Republic |
City | Prague |
Period | 5/22/11 → 5/27/11 |