Efficient and distributed algorithms for large-scale generalized canonical correlations analysis

Xiao Fu; Kejun Huang; Evangelos E. Papalexakis; Hyun Ah Song; Partha Pratim Talukdar; Nicholas D. Sidiropoulos; Christos Faloutsos; Tom Mitchell

doi:10.1109/ICDM.2016.78

Efficient and distributed algorithms for large-scale generalized canonical correlations analysis

Xiao Fu, Kejun Huang, Evangelos E. Papalexakis, Hyun Ah Song, Partha Pratim Talukdar, Nicholas D. Sidiropoulos, Christos Faloutsos, Tom Mitchell

Electrical and Computer Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

19 Scopus citations

Abstract

extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains -An extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100, 000 - while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.

Original language	English (US)
Title of host publication	Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
Editors	Francesco Bonchi, Josep Domingo-Ferrer, Ricardo Baeza-Yates, Zhi-Hua Zhou, Xindong Wu
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	871-876
Number of pages	6
ISBN (Electronic)	9781509054725
DOIs	https://doi.org/10.1109/ICDM.2016.78
State	Published - Jul 2 2016
Event	16th IEEE International Conference on Data Mining, ICDM 2016 - Barcelona, Catalonia, Spain Duration: Dec 12 2016 → Dec 15 2016

Publication series

Name	Proceedings - IEEE International Conference on Data Mining, ICDM
Volume	0
ISSN (Print)	1550-4786

Other

Other	16th IEEE International Conference on Data Mining, ICDM 2016
Country/Territory	Spain
City	Barcelona, Catalonia
Period	12/12/16 → 12/15/16

Bibliographical note

Publisher Copyright:
© 2016 IEEE.

Keywords

Distributed GCCA
Lagre-scale generalized canonical correlation analysis
Multilingual word embeddings

Access

10.1109/ICDM.2016.78

OpenUrl availability

Full text

Cite this

Fu, X., Huang, K., Papalexakis, E. E., Song, H. A., Talukdar, P. P., Sidiropoulos, N. D., Faloutsos, C., & Mitchell, T. (2016). Efficient and distributed algorithms for large-scale generalized canonical correlations analysis. In F. Bonchi, J. Domingo-Ferrer, R. Baeza-Yates, Z.-H. Zhou, & X. Wu (Eds.), Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016 (pp. 871-876). Article 7837918 (Proceedings - IEEE International Conference on Data Mining, ICDM; Vol. 0). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICDM.2016.78

Efficient and distributed algorithms for large-scale generalized canonical correlations analysis. / Fu, Xiao; Huang, Kejun; Papalexakis, Evangelos E. et al.
Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016. ed. / Francesco Bonchi; Josep Domingo-Ferrer; Ricardo Baeza-Yates; Zhi-Hua Zhou; Xindong Wu. Institute of Electrical and Electronics Engineers Inc., 2016. p. 871-876 7837918 (Proceedings - IEEE International Conference on Data Mining, ICDM; Vol. 0).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Fu, X, Huang, K, Papalexakis, EE, Song, HA, Talukdar, PP, Sidiropoulos, ND, Faloutsos, C & Mitchell, T 2016, Efficient and distributed algorithms for large-scale generalized canonical correlations analysis. in F Bonchi, J Domingo-Ferrer, R Baeza-Yates, Z-H Zhou & X Wu (eds), Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016., 7837918, Proceedings - IEEE International Conference on Data Mining, ICDM, vol. 0, Institute of Electrical and Electronics Engineers Inc., pp. 871-876, 16th IEEE International Conference on Data Mining, ICDM 2016, Barcelona, Catalonia, Spain, 12/12/16. https://doi.org/10.1109/ICDM.2016.78

Fu X, Huang K, Papalexakis EE, Song HA, Talukdar PP, Sidiropoulos ND et al. Efficient and distributed algorithms for large-scale generalized canonical correlations analysis. In Bonchi F, Domingo-Ferrer J, Baeza-Yates R, Zhou ZH, Wu X, editors, Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 871-876. 7837918. (Proceedings - IEEE International Conference on Data Mining, ICDM). doi: 10.1109/ICDM.2016.78

Fu, Xiao ; Huang, Kejun ; Papalexakis, Evangelos E. et al. / Efficient and distributed algorithms for large-scale generalized canonical correlations analysis. Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016. editor / Francesco Bonchi ; Josep Domingo-Ferrer ; Ricardo Baeza-Yates ; Zhi-Hua Zhou ; Xindong Wu. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 871-876 (Proceedings - IEEE International Conference on Data Mining, ICDM).

@inproceedings{8b37a807ae0a44b68ae3142fdf2a6b06,

title = "Efficient and distributed algorithms for large-scale generalized canonical correlations analysis",

abstract = "extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains -An extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100, 000 - while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.",

keywords = "Distributed GCCA, Lagre-scale generalized canonical correlation analysis, Multilingual word embeddings",

author = "Xiao Fu and Kejun Huang and Papalexakis, {Evangelos E.} and Song, {Hyun Ah} and Talukdar, {Partha Pratim} and Sidiropoulos, {Nicholas D.} and Christos Faloutsos and Tom Mitchell",

note = "Publisher Copyright: {\textcopyright} 2016 IEEE.; 16th IEEE International Conference on Data Mining, ICDM 2016 ; Conference date: 12-12-2016 Through 15-12-2016",

year = "2016",

month = jul,

day = "2",

doi = "10.1109/ICDM.2016.78",

language = "English (US)",

series = "Proceedings - IEEE International Conference on Data Mining, ICDM",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "871--876",

editor = "Francesco Bonchi and Josep Domingo-Ferrer and Ricardo Baeza-Yates and Zhi-Hua Zhou and Xindong Wu",

booktitle = "Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016",

}

TY - GEN

T1 - Efficient and distributed algorithms for large-scale generalized canonical correlations analysis

AU - Fu, Xiao

AU - Huang, Kejun

AU - Papalexakis, Evangelos E.

AU - Song, Hyun Ah

AU - Talukdar, Partha Pratim

AU - Sidiropoulos, Nicholas D.

AU - Faloutsos, Christos

AU - Mitchell, Tom

PY - 2016/7/2

Y1 - 2016/7/2

N2 - extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains -An extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100, 000 - while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.

AB - extracting common structure from multiple 'views', i.e., high-dimensional matrices representing the same objects in different feature domains -An extension of classical two-view CCA. Existing (G)CCA algorithms have serious scalability issues, since they involve square root factorization of the correlation matrices of the views. The memory and computational complexity associated with this step grow as a quadratic and cubic function of the problem dimension (the number of samples / features), respectively. To circumvent such difficulties, we propose a GCCA algorithm whose memory and computational costs scale linearly in the problem dimension and the number of nonzero data elements, respectively. Consequently, the proposed algorithm can easily handle very large sparse views whose sample and feature dimensions both exceed 100, 000 - while the current approaches can only handle thousands of features / samples. Our second contribution is a distributed algorithm for GCCA, which computes the canonical components of different views in parallel and thus can further reduce the runtime significantly (by ≥ 30% in experiments) if multiple cores are available. Judiciously designed synthetic and real-data experiments using a multilingual dataset are employed to showcase the effectiveness of the proposed algorithms.

KW - Distributed GCCA

KW - Lagre-scale generalized canonical correlation analysis

KW - Multilingual word embeddings

UR - http://www.scopus.com/inward/record.url?scp=85014518269&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85014518269&partnerID=8YFLogxK

U2 - 10.1109/ICDM.2016.78

DO - 10.1109/ICDM.2016.78

M3 - Conference contribution

AN - SCOPUS:85014518269

T3 - Proceedings - IEEE International Conference on Data Mining, ICDM

SP - 871

EP - 876

BT - Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016

A2 - Bonchi, Francesco

A2 - Domingo-Ferrer, Josep

A2 - Baeza-Yates, Ricardo

A2 - Zhou, Zhi-Hua

A2 - Wu, Xindong

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 16th IEEE International Conference on Data Mining, ICDM 2016

Y2 - 12 December 2016 through 15 December 2016

ER -

Efficient and distributed algorithms for large-scale generalized canonical correlations analysis

Abstract

Publication series

Other

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this