TY - GEN
T1 - Multi-stage coordinated prefetching for present-day processors
AU - Mehta, Sanyam
AU - Fang, Zhenman
AU - Zhai, Antonia
AU - Yew, Pen Chung
PY - 2014
Y1 - 2014
N2 - Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.
AB - Data prefetching is an important technique for hiding memory latency. Latest microarchitectures provide support for both hardware and software prefetching. However, the architectural features supporting either are different. In addition, these features can vary from one architecture to another. As a result, the choice of the right prefetching strategy is non-trivial for both the programmers and compiler-writers. In this paper, we study different prefetching techniques in the context of different architectural features that support prefetching on existing hardware platforms. These features include, the size of the line fill buffer or the Miss Status Handling Registers servicing prefetch requests at each level of cache, the aggressiveness and effectiveness of the hardware prefetchers, interaction between software prefetch requests and the hardware prefetcher, the nature of the instruction pipeline (in-order/out-of-order execution), etc. Our experiments with two widely different processors, a latest multi-core (SandyBridge) and a many-core (Xeon Phi) processor, show that these architectural features have a significant bearing on the prefetching choice in a given source program, so much so that the best prefetching technique on SandyBridge performs worst on Xeon Phi and vice-versa. Based on our study of the interaction between the host architecture and prefetching, we find that coordinated multi-stage prefetching that brings data closer to the core in stages, yields best performance. On SandyBridge, the mid-level cache hardware prefetcher and L1 software prefetching coordinate to achieve this end, whereas on Xeon Phi, pure software prefetching proves adequate. We implement our algorithm in the ROSE source-to-source compiler framework. Experimental results demonstrate that coordinated prefetching achieves a speed-up (geometric mean over benchmarks from the SPEC suite) of 1.55X and 1.3X against the hardware prefetcher and the Intel compiler, respectively, on Xeon Phi. On SandyBridge, a speed-up of 1.08X is obtained over its effective hardware prefetcher.
KW - coordinated prefetching
KW - sandybridge
KW - xeonphi
UR - http://www.scopus.com/inward/record.url?scp=84903761716&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84903761716&partnerID=8YFLogxK
U2 - 10.1145/2597652.2597660
DO - 10.1145/2597652.2597660
M3 - Conference contribution
AN - SCOPUS:84903761716
SN - 9781450326421
T3 - Proceedings of the International Conference on Supercomputing
SP - 73
EP - 82
BT - ICS 2014 - Proceedings of the 28th ACM International Conference on Supercomputing
PB - Association for Computing Machinery
T2 - 28th ACM International Conference on Supercomputing, ICS 2014
Y2 - 10 June 2014 through 13 June 2014
ER -