TY - GEN
T1 - TurboTiling
T2 - 30th International Conference on Supercomputing, ICS 2016
AU - Mehta, Sanyam
AU - Garg, Rajat
AU - Trivedi, Nishad
AU - Yew, Pen Chung
N1 - Publisher Copyright:
© 2016 ACM.
Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2016/6/1
Y1 - 2016/6/1
N2 - Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.
AB - Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.
KW - Loop tiling
KW - Multi-core
KW - Prefetching
UR - http://www.scopus.com/inward/record.url?scp=84978513059&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84978513059&partnerID=8YFLogxK
U2 - 10.1145/2925426.2926288
DO - 10.1145/2925426.2926288
M3 - Conference contribution
AN - SCOPUS:84978513059
T3 - Proceedings of the International Conference on Supercomputing
BT - Proceedings of the 2016 International Conference on Supercomputing, ICS 2016
PB - Association for Computing Machinery
Y2 - 1 June 2016 through 3 June 2016
ER -