TurboTiling: Leveraging prefetching to boost performance of tiled codes

Sanyam Mehta; Rajat Garg; Nishad Trivedi; Pen Chung Yew

doi:10.1145/2925426.2926288

TurboTiling: Leveraging prefetching to boost performance of tiled codes

Sanyam Mehta, Rajat Garg, Nishad Trivedi, Pen Chung Yew

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

13 Scopus citations

Abstract

Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

Original language	English (US)
Title of host publication	Proceedings of the 2016 International Conference on Supercomputing, ICS 2016
Publisher	Association for Computing Machinery
ISBN (Electronic)	9781450343619
DOIs	https://doi.org/10.1145/2925426.2926288
State	Published - Jun 1 2016
Event	30th International Conference on Supercomputing, ICS 2016 - Istanbul, Turkey Duration: Jun 1 2016 → Jun 3 2016

Publication series

Name	Proceedings of the International Conference on Supercomputing
Volume	01-03-June-2016

Other

Other	30th International Conference on Supercomputing, ICS 2016
Country/Territory	Turkey
City	Istanbul
Period	6/1/16 → 6/3/16

Bibliographical note

Publisher Copyright:
© 2016 ACM.

Keywords

Loop tiling
Multi-core
Prefetching

Access

10.1145/2925426.2926288

OpenUrl availability

Full text

Cite this

Mehta, S., Garg, R., Trivedi, N., & Yew, P. C. (2016). TurboTiling: Leveraging prefetching to boost performance of tiled codes. In Proceedings of the 2016 International Conference on Supercomputing, ICS 2016 Article a38 (Proceedings of the International Conference on Supercomputing; Vol. 01-03-June-2016). Association for Computing Machinery. https://doi.org/10.1145/2925426.2926288

TurboTiling: Leveraging prefetching to boost performance of tiled codes. / Mehta, Sanyam; Garg, Rajat; Trivedi, Nishad et al.
Proceedings of the 2016 International Conference on Supercomputing, ICS 2016. Association for Computing Machinery, 2016. a38 (Proceedings of the International Conference on Supercomputing; Vol. 01-03-June-2016).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Mehta, S, Garg, R, Trivedi, N & Yew, PC 2016, TurboTiling: Leveraging prefetching to boost performance of tiled codes. in Proceedings of the 2016 International Conference on Supercomputing, ICS 2016., a38, Proceedings of the International Conference on Supercomputing, vol. 01-03-June-2016, Association for Computing Machinery, 30th International Conference on Supercomputing, ICS 2016, Istanbul, Turkey, 6/1/16. https://doi.org/10.1145/2925426.2926288

@inproceedings{40df0070d751462fa9297a00350d1cd9,

title = "TurboTiling: Leveraging prefetching to boost performance of tiled codes",

abstract = "Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.",

keywords = "Loop tiling, Multi-core, Prefetching",

author = "Sanyam Mehta and Rajat Garg and Nishad Trivedi and Yew, {Pen Chung}",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 30th International Conference on Supercomputing, ICS 2016 ; Conference date: 01-06-2016 Through 03-06-2016",

year = "2016",

month = jun,

day = "1",

doi = "10.1145/2925426.2926288",

language = "English (US)",

series = "Proceedings of the International Conference on Supercomputing",

publisher = "Association for Computing Machinery",

booktitle = "Proceedings of the 2016 International Conference on Supercomputing, ICS 2016",

}

TY - GEN

T1 - TurboTiling

T2 - 30th International Conference on Supercomputing, ICS 2016

AU - Mehta, Sanyam

AU - Garg, Rajat

AU - Trivedi, Nishad

AU - Yew, Pen Chung

PY - 2016/6/1

Y1 - 2016/6/1

N2 - Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

AB - Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.

KW - Loop tiling

KW - Multi-core

KW - Prefetching

UR - http://www.scopus.com/inward/record.url?scp=84978513059&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84978513059&partnerID=8YFLogxK

U2 - 10.1145/2925426.2926288

DO - 10.1145/2925426.2926288

M3 - Conference contribution

AN - SCOPUS:84978513059

T3 - Proceedings of the International Conference on Supercomputing

BT - Proceedings of the 2016 International Conference on Supercomputing, ICS 2016

PB - Association for Computing Machinery

Y2 - 1 June 2016 through 3 June 2016

ER -

TurboTiling: Leveraging prefetching to boost performance of tiled codes

Abstract

Publication series

Other

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this