Streaming tensor factorization for infinite data sources

Shaden Smith, Kejun Huang, Nicholas D. Sidiropoulos, George Karypis

Research output: Contribution to conferencePaperpeer-review

30 Scopus citations

Abstract

Sparse tensor factorization is a popular tool in multi-way data analysis and is used in applications such as cybersecurity, recommender systems, and social network analysis. In many of these applications, the tensor is not known a priori and instead arrives in a streaming fashion for a potentially unbounded amount of time. Existing approaches for streaming sparse tensors are not practical for unbounded streaming because they rely on maintaining the full factorization of the data, which grows linearly with time. In this work, we present CP-stream, an algorithm for streaming factorization in the model of the canonical polyadic decomposition which does not grow linearly in time or space, and is thus practical for long-term streaming. Additionally, CP-stream incorporates user-specified constraints such as non-negativity which aid in the stability and interpretability of the factorization. An evaluation of CP-stream demonstrates that it converges faster than state-of-the-art streaming algorithms while achieving lower reconstruction error by an order of magnitude. We also evaluate it on real-world sparse datasets and demonstrate its usability in both network traffic analysis and discussion tracking. Our evaluation uses exclusively public datasets and our source code is released to the public as part of SPLATT, an open source high-performance tensor factorization toolkit.

Original languageEnglish (US)
Pages81-89
Number of pages9
DOIs
StatePublished - 2018
Event2018 SIAM International Conference on Data Mining, SDM 2018 - San Diego, United States
Duration: May 3 2018May 5 2018

Other

Other2018 SIAM International Conference on Data Mining, SDM 2018
Country/TerritoryUnited States
CitySan Diego
Period5/3/185/5/18

Bibliographical note

Funding Information:
Group, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Funding Information:
Shaden Smith was located at the University of Minnesota during the majority of this work. This work was supported in part by NSF (IIS-0905220, OCI-1048018, CNS-1162405, IIS-1247632, IIP-1414153, IIS-1447788), Army Research Office (W911NF-14-1-0316), a University of Minnesota Doctoral Dissertation Fellowship, Intel Software and Services Group, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.

Funding Information:
Shaden Smith was located at the University of Minnesota during the majority of this work. This work was supported in part by NSF (IIS-0905220, OCI-1048018, CNS-1162405, IIS-1247632, IIP-1414153, IIS-1447788), Army Research Office (W911NF-14-1-0316), a University of Minnesota Doctoral Dissertation Fellowship, Intel Software and Services

Publisher Copyright:
© 2018 by SIAM.

Fingerprint

Dive into the research topics of 'Streaming tensor factorization for infinite data sources'. Together they form a unique fingerprint.

Cite this