TY - GEN
T1 - A divide-and-conquer approach for solving singular value decomposition on a heterogeneous system
AU - Liu, Ding
AU - Li, Ruixuan
AU - Lilja, David J.
AU - Xiao, Weijun
PY - 2013
Y1 - 2013
N2 - Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performance solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs
AB - Singular value decomposition (SVD) is a fundamental linear operation that has been used for many applications, such as pattern recognition and statistical information processing. In order to accelerate this time-consuming operation, this paper presents a new divide-and-conquer approach for solving SVD on a heterogeneous CPU-GPU system. We carefully design our algorithm to match the mathematical requirements of SVD to the unique characteristics of a heterogeneous computing platform. This includes a high-performance solution to the secular equation with good numerical stability, overlapping the CPU and the GPU tasks, and leveraging the GPU bandwidth in a heterogeneous system. The experimental results show that our algorithm has better performance than MKL's divide-and-conquer routine [18] with four cores (eight hardware threads) when the size of the input matrix is larger than 3000. Furthermore, it is up to 33 times faster than LAPACK's divide-and-conquer routine [17], 3 times faster than MKL's divide-and-conquer routine with four cores, and 7 times faster than CULA on the same device, when the size of the matrix grows up to 14,000. Our algorithm is also much faster than previous SVD approaches on GPUs
KW - Divide-and-conquer
KW - Heterogeneous architecture
KW - Performance evaluation
KW - Singular value decomposition (SVD)
UR - http://www.scopus.com/inward/record.url?scp=84879529598&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84879529598&partnerID=8YFLogxK
U2 - 10.1145/2482767.2482813
DO - 10.1145/2482767.2482813
M3 - Conference contribution
AN - SCOPUS:84879529598
SN - 9781450320535
T3 - Proceedings of the ACM International Conference on Computing Frontiers, CF 2013
BT - Proceedings of the ACM International Conference on Computing Frontiers, CF 2013
T2 - 2013 ACM International Conference on Computing Frontiers, CF 2013
Y2 - 14 May 2013 through 16 May 2013
ER -