HPC formulations of optimization algorithms for tensor completion

Shaden Smith; Jongsoo Park; George Karypis

doi:10.1016/j.parco.2017.11.002

HPC formulations of optimization algorithms for tensor completion

Shaden Smith, Jongsoo Park, George Karypis

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.

Original language	English (US)
Pages (from-to)	99-117
Number of pages	19
Journal	Parallel Computing
Volume	74
DOIs	https://doi.org/10.1016/j.parco.2017.11.002
State	Published - May 2018

Bibliographical note

Funding Information:
This work is an extended and revised version of a preliminary conference paper [32] . The authors would like to thank Mikhail Smelyanskiy for valuable discussions, Jeff Hammond for generously donating computing time at NERSC, and Karlsson et al. for sharing source code used for evaluation. This work was supported in part by NSF ( IIS-0905220 , OCI-1048018 , CNS-1162405 , IIS-1247632 , IIP-1414153 , IIS-1447788 ), Army Research Office ( W911NF-14-1-0316 ), a University of Minnesota Doctoral Dissertation Fellowship, Intel Software and Services Group, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 .

Publisher Copyright:
© 2017

Keywords

Factorization
Machine learning
Recommender system
Sparse tensor
Tensor completion
Unstructured algorithm

Access

10.1016/j.parco.2017.11.002

OpenUrl availability

Full text

Cite this

@article{824574a6e05c493889405a31a5b450f5,

title = "HPC formulations of optimization algorithms for tensor completion",

abstract = "Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.",

keywords = "Factorization, Machine learning, Recommender system, Sparse tensor, Tensor completion, Unstructured algorithm",

author = "Shaden Smith and Jongsoo Park and George Karypis",

note = "Funding Information: This work is an extended and revised version of a preliminary conference paper [32] . The authors would like to thank Mikhail Smelyanskiy for valuable discussions, Jeff Hammond for generously donating computing time at NERSC, and Karlsson et al. for sharing source code used for evaluation. This work was supported in part by NSF ( IIS-0905220 , OCI-1048018 , CNS-1162405 , IIS-1247632 , IIP-1414153 , IIS-1447788 ), Army Research Office ( W911NF-14-1-0316 ), a University of Minnesota Doctoral Dissertation Fellowship, Intel Software and Services Group, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 . Publisher Copyright: {\textcopyright} 2017",

year = "2018",

month = may,

doi = "10.1016/j.parco.2017.11.002",

language = "English (US)",

volume = "74",

pages = "99--117",

journal = "Parallel Computing",

issn = "0167-8191",

publisher = "Elsevier",

}

TY - JOUR

T1 - HPC formulations of optimization algorithms for tensor completion

AU - Smith, Shaden

AU - Park, Jongsoo

AU - Karypis, George

N1 - Funding Information: This work is an extended and revised version of a preliminary conference paper [32] . The authors would like to thank Mikhail Smelyanskiy for valuable discussions, Jeff Hammond for generously donating computing time at NERSC, and Karlsson et al. for sharing source code used for evaluation. This work was supported in part by NSF ( IIS-0905220 , OCI-1048018 , CNS-1162405 , IIS-1247632 , IIP-1414153 , IIS-1447788 ), Army Research Office ( W911NF-14-1-0316 ), a University of Minnesota Doctoral Dissertation Fellowship, Intel Software and Services Group, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute. This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231 . Publisher Copyright: © 2017

PY - 2018/5

Y1 - 2018/5

N2 - Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.

AB - Tensor completion is a powerful tool used to estimate or recover missing values in multi-way data. It has seen great success in domains such as product recommendation and healthcare. Tensor completion is most often accomplished via low-rank sparse tensor factorization, a computationally expensive non-convex optimization problem which has only recently been studied in the context of parallel computing. In this work, we study three optimization algorithms that have been successfully applied to tensor completion: alternating least squares (ALS), stochastic gradient descent (SGD), and coordinate descent (CCD++). We explore opportunities for parallelism on shared- and distributed-memory systems and address challenges such as memory- and operation-efficiency, load balance, cache locality, and communication. Among our advancements are a communication-efficient CCD++ algorithm, an ALS algorithm rich in level-3 BLAS routines, and an SGD algorithm which combines stratification with asynchronous communication. Furthermore, we show that introducing randomization during ALS and CCD++ can accelerate convergence. We evaluate our parallel formulations on a variety of real datasets on a modern supercomputer and demonstrate speedups through 16384 cores. These improvements reduce time-to-solution from hours to seconds on real-world datasets. We show that after our optimizations, ALS is advantageous on parallel systems of small-to-moderate scale, while both ALS and CCD++ provide the lowest time-to-solution on large-scale distributed systems.

KW - Factorization

KW - Machine learning

KW - Recommender system

KW - Sparse tensor

KW - Tensor completion

KW - Unstructured algorithm

UR - http://www.scopus.com/inward/record.url?scp=85034455895&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85034455895&partnerID=8YFLogxK

U2 - 10.1016/j.parco.2017.11.002

DO - 10.1016/j.parco.2017.11.002

M3 - Article

AN - SCOPUS:85034455895

SN - 0167-8191

VL - 74

SP - 99

EP - 117

JO - Parallel Computing

JF - Parallel Computing

ER -

HPC formulations of optimization algorithms for tensor completion

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this