TY - JOUR
T1 - Data prefetching and data forwarding in shared memory multiprocessors
AU - Poulsen, D. K.
AU - Yew, Pen Chung
PY - 1994
Y1 - 1994
N2 - This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. The use of data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. Algorithms for data prefetching and data forwarding are implemented in a parallelizing compiler. Evaluation of the proposed schemes and algorithms is accomplished via execution-driven simulation of large, optimized, parallel numerical application codes with loop-level and vector parallelism. More data, discussion, and experiment details can be found in [1].
AB - This paper studies and compares the use of data prefetching and an alternative mechanism, data forwarding, for reducing memory latency due to interprocessor communication in cache coherent, shared memory multiprocessors. Two multiprocessor prefetching algorithms are presented and compared. A simple blocked vector prefetching algorithm, considerably less complex than existing software pipelined prefetching algorithms, is shown to be effective in reducing memory latency and increasing performance. A Forwarding Write operation is used to evaluate the effectiveness of forwarding. The use of data forwarding results in significant performance improvements over data prefetching for codes exhibiting less spatial locality. Algorithms for data prefetching and data forwarding are implemented in a parallelizing compiler. Evaluation of the proposed schemes and algorithms is accomplished via execution-driven simulation of large, optimized, parallel numerical application codes with loop-level and vector parallelism. More data, discussion, and experiment details can be found in [1].
UR - http://www.scopus.com/inward/record.url?scp=77954460854&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954460854&partnerID=8YFLogxK
U2 - 10.1109/ICPP.1994.81
DO - 10.1109/ICPP.1994.81
M3 - Conference article
AN - SCOPUS:77954460854
SN - 0190-3918
VL - 2
SP - II276-II280
JO - Proceedings of the International Conference on Parallel Processing
JF - Proceedings of the International Conference on Parallel Processing
M1 - 5727799
T2 - 23rd International Conference on Parallel Processing, ICPP 1994
Y2 - 15 August 1994 through 19 August 1994
ER -