This paper considers the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency caused by interprocessor communication in cache coherent, shared memory multiprocessors. Data prefetching is accomplished by using a multiprocessor software pipelined algorithm. Data forwarding is used to target interprocessor data communication, rather than synchronization, and is applied to communication-related accesses between successive parallel loops. Prefetching and forwarding are each shown to be more effective for certain types of architectural and application characteristics. Given this result, a new hybrid prefetching and forwarding approach is proposed and evaluated that allows the relative amounts of prefetching and forwarding used to be adapted to these characteristics. When compared to prefetching or forwarding alone, the new hybrid scheme is shown to increase performance stability over varying application characteristics, to reduce processor instruction overheads, cache miss ratios, and memory system bandwidth requirements, and to reduce performance sensitivity to architectural parameters such as cache size. Algorithms for data prefetching, data forwarding, and hybrid prefetching and forwarding are described. These algorithms are applied by using a parallelizing compiler and are evaluated via execution-driven simulations of large, optimized, numerical application codes with loop-level and vector parallelism.
Bibliographical noteFunding Information:
1 This work was supported in part by the National Science Foundation under Grants NSF MIP 93–07910 and NSF MIP 89–20891, the Department of Energy under Grant DOE DE FG02–85ER25001, the National Security Agency, and an IBM Resident Study Fellowship. This work was performed while the authors were with the Center for Supercomputing Research and Development, University of Illinois at Urbana–Champaign, Urbana, IL 61801. 2E-mail: firstname.lastname@example.org. 3E-mail: email@example.com.