Multi-stage coordinated prefetching for present-day processors

Sanyam Mehta; Zhenman Fang; Antonia Zhai; Pen Chung Yew

doi:10.1145/2597652.2597660

Multi-stage coordinated prefetching for present-day processors

Sanyam Mehta, Zhenman Fang, Antonia Zhai, Pen Chung Yew

Computer Science and Engineering

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

20 Scopus citations

Abstract

Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.

Original language	English (US)
Title of host publication	ICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing
Publisher	Association for Computing Machinery
Pages	73-82
Number of pages	10
ISBN (Print)	9781450326421
DOIs	https://doi.org/10.1145/2597652.2597660
State	Published - 2014
Event	28th ACM International Conference on Supercomputing, ICS 2014 - Munich, Germany Duration: Jun 10 2014 → Jun 13 2014

Publication series

Name	Proceedings of the International Conference on Supercomputing

Other

Other	28th ACM International Conference on Supercomputing, ICS 2014
Country/Territory	Germany
City	Munich
Period	6/10/14 → 6/13/14

Keywords

coordinated prefetching
sandybridge
xeonphi

Access

10.1145/2597652.2597660

OpenUrl availability

Full text

Cite this

Multi-stage coordinated prefetching for present-day processors. / Mehta, Sanyam; Fang, Zhenman; Zhai, Antonia et al.
ICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing. Association for Computing Machinery, 2014. p. 73-82 (Proceedings of the International Conference on Supercomputing).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Mehta, S, Fang, Z, Zhai, A & Yew, PC 2014, Multi-stage coordinated prefetching for present-day processors. in ICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing. Proceedings of the International Conference on Supercomputing, Association for Computing Machinery, pp. 73-82, 28th ACM International Conference on Supercomputing, ICS 2014, Munich, Germany, 6/10/14. https://doi.org/10.1145/2597652.2597660

@inproceedings{32de567912d64e7f8ba7d71c35772bc0,

title = "Multi-stage coordinated prefetching for present-day processors",

abstract = "Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.",

keywords = "coordinated prefetching, sandybridge, xeonphi",

author = "Sanyam Mehta and Zhenman Fang and Antonia Zhai and Yew, {Pen Chung}",

year = "2014",

doi = "10.1145/2597652.2597660",

language = "English (US)",

isbn = "9781450326421",

series = "Proceedings of the International Conference on Supercomputing",

publisher = "Association for Computing Machinery",

pages = "73--82",

booktitle = "ICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing",

note = "28th ACM International Conference on Supercomputing, ICS 2014 ; Conference date: 10-06-2014 Through 13-06-2014",

}

TY - GEN

T1 - Multi-stage coordinated prefetching for present-day processors

AU - Mehta, Sanyam

AU - Fang, Zhenman

AU - Zhai, Antonia

AU - Yew, Pen Chung

PY - 2014

Y1 - 2014

N2 - Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.

AB - Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.

KW - coordinated prefetching

KW - sandybridge

KW - xeonphi

UR - http://www.scopus.com/inward/record.url?scp=84903761716&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84903761716&partnerID=8YFLogxK

U2 - 10.1145/2597652.2597660

DO - 10.1145/2597652.2597660

M3 - Conference contribution

AN - SCOPUS:84903761716

SN - 9781450326421

T3 - Proceedings of the International Conference on Supercomputing

SP - 73

EP - 82

BT - ICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing

PB - Association for Computing Machinery

T2 - 28th ACM International Conference on Supercomputing, ICS 2014

Y2 - 10 June 2014 through 13 June 2014

ER -

Multi-stage coordinated prefetching for present-day processors

Abstract

Publication series

Other

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this