A high-bandwidth memory pipeline for wide issue processors

Sangyeun Cho; Pen Chung Yew; Gyungho Lee

doi:10.1109/12.936237

A high-bandwidth memory pipeline for wide issue processors

Sangyeun Cho, Pen Chung Yew, Gyungho Lee

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

8 Scopus citations

Abstract

Providing adequate data bandwidth is extremely important for a future wide-issue processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called data decoupling. This paper especially studies an interesting, yet less explored, behavior of memory access instructions, called access region locality, which is concerned with each static memory instruction and its range of access locations at runtime. Our experimental study using a set of SPEC95 benchmark programs shows that most memory access instructions reference a single region at runtime. Also shown is that it is possible to accurately predict the access region of a memory instruction at runtime by scrutinizing the addressing mode of the instruction and the past access history of it. We describe and evaluate a wide-issue superscalar processor with two distinct sets of memory pipelines and caches, driven by the access region predictor. Experimental results indicate that the proposed mechanism is very effective in providing high memory bandwidth to the processor, resulting in comparable or better performance than a conventional memory design with a heavily multiported data cache that can lead to much higher hardware complexity.

Original language	English (US)
Pages (from-to)	709-723
Number of pages	15
Journal	IEEE Transactions on Computers
Volume	50
Issue number	7
DOIs	https://doi.org/10.1109/12.936237
State	Published - Jul 2001

Bibliographical note

Funding Information:
This work was supported in part by the US National Science Foundation under grant numbers MIP-9610379 and CDA-9502979; by the US Army Intelligence Center and Fort Huachuca under contract DABT63-95-C-0127 and ARPA order number D346, and a gift from the Intel Corporation. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the US Army Intelligence Center and Fort Huachuca or the US government. The authors thank Todd Austin and Doug Burger for developing the Simplescalar tool set that has been indispensable for our study. The anonymous referees provided many constructive comments that greatly helped improve the quality of this paper. Sangyeun Cho was supported in part by a doctoral fellowship from the Korea Foundation for Advanced Studies (KFAS). Gyungho Lee was supported in part by the US National Science Foundation under grant number CCR-0073259 and by Alpha Processor, Inc.

Keywords

Data bandwidth
Data locality
Data stream partitioning
Instruction level parallelism
Multiported data cache
Runtime stack

Access

10.1109/12.936237

OpenUrl availability

Full text

Cite this

@article{34f90cc714024b519d974f94d494353a,

title = "A high-bandwidth memory pipeline for wide issue processors",

abstract = "Providing adequate data bandwidth is extremely important for a future wide-issue processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called data decoupling. This paper especially studies an interesting, yet less explored, behavior of memory access instructions, called access region locality, which is concerned with each static memory instruction and its range of access locations at runtime. Our experimental study using a set of SPEC95 benchmark programs shows that most memory access instructions reference a single region at runtime. Also shown is that it is possible to accurately predict the access region of a memory instruction at runtime by scrutinizing the addressing mode of the instruction and the past access history of it. We describe and evaluate a wide-issue superscalar processor with two distinct sets of memory pipelines and caches, driven by the access region predictor. Experimental results indicate that the proposed mechanism is very effective in providing high memory bandwidth to the processor, resulting in comparable or better performance than a conventional memory design with a heavily multiported data cache that can lead to much higher hardware complexity.",

keywords = "Data bandwidth, Data locality, Data stream partitioning, Instruction level parallelism, Multiported data cache, Runtime stack",

author = "Sangyeun Cho and Yew, {Pen Chung} and Gyungho Lee",

note = "Funding Information: This work was supported in part by the US National Science Foundation under grant numbers MIP-9610379 and CDA-9502979; by the US Army Intelligence Center and Fort Huachuca under contract DABT63-95-C-0127 and ARPA order number D346, and a gift from the Intel Corporation. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the US Army Intelligence Center and Fort Huachuca or the US government. The authors thank Todd Austin and Doug Burger for developing the Simplescalar tool set that has been indispensable for our study. The anonymous referees provided many constructive comments that greatly helped improve the quality of this paper. Sangyeun Cho was supported in part by a doctoral fellowship from the Korea Foundation for Advanced Studies (KFAS). Gyungho Lee was supported in part by the US National Science Foundation under grant number CCR-0073259 and by Alpha Processor, Inc.",

year = "2001",

month = jul,

doi = "10.1109/12.936237",

language = "English (US)",

volume = "50",

pages = "709--723",

journal = "IEEE Transactions on Computers",

issn = "0018-9340",

publisher = "IEEE Computer Society",

number = "7",

}

TY - JOUR

T1 - A high-bandwidth memory pipeline for wide issue processors

AU - Cho, Sangyeun

AU - Yew, Pen Chung

AU - Lee, Gyungho

N1 - Funding Information: This work was supported in part by the US National Science Foundation under grant numbers MIP-9610379 and CDA-9502979; by the US Army Intelligence Center and Fort Huachuca under contract DABT63-95-C-0127 and ARPA order number D346, and a gift from the Intel Corporation. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the US Army Intelligence Center and Fort Huachuca or the US government. The authors thank Todd Austin and Doug Burger for developing the Simplescalar tool set that has been indispensable for our study. The anonymous referees provided many constructive comments that greatly helped improve the quality of this paper. Sangyeun Cho was supported in part by a doctoral fellowship from the Korea Foundation for Advanced Studies (KFAS). Gyungho Lee was supported in part by the US National Science Foundation under grant number CCR-0073259 and by Alpha Processor, Inc.

PY - 2001/7

Y1 - 2001/7

N2 - Providing adequate data bandwidth is extremely important for a future wide-issue processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called data decoupling. This paper especially studies an interesting, yet less explored, behavior of memory access instructions, called access region locality, which is concerned with each static memory instruction and its range of access locations at runtime. Our experimental study using a set of SPEC95 benchmark programs shows that most memory access instructions reference a single region at runtime. Also shown is that it is possible to accurately predict the access region of a memory instruction at runtime by scrutinizing the addressing mode of the instruction and the past access history of it. We describe and evaluate a wide-issue superscalar processor with two distinct sets of memory pipelines and caches, driven by the access region predictor. Experimental results indicate that the proposed mechanism is very effective in providing high memory bandwidth to the processor, resulting in comparable or better performance than a conventional memory design with a heavily multiported data cache that can lead to much higher hardware complexity.

AB - Providing adequate data bandwidth is extremely important for a future wide-issue processor to achieve its full performance potential. Adding a large number of ports to a data cache, however, becomes increasingly inefficient and can add to the hardware complexity significantly. This paper takes an alternative or complementary approach for providing more data bandwidth, called data decoupling. This paper especially studies an interesting, yet less explored, behavior of memory access instructions, called access region locality, which is concerned with each static memory instruction and its range of access locations at runtime. Our experimental study using a set of SPEC95 benchmark programs shows that most memory access instructions reference a single region at runtime. Also shown is that it is possible to accurately predict the access region of a memory instruction at runtime by scrutinizing the addressing mode of the instruction and the past access history of it. We describe and evaluate a wide-issue superscalar processor with two distinct sets of memory pipelines and caches, driven by the access region predictor. Experimental results indicate that the proposed mechanism is very effective in providing high memory bandwidth to the processor, resulting in comparable or better performance than a conventional memory design with a heavily multiported data cache that can lead to much higher hardware complexity.

KW - Data bandwidth

KW - Data locality

KW - Data stream partitioning

KW - Instruction level parallelism

KW - Multiported data cache

KW - Runtime stack

UR - http://www.scopus.com/inward/record.url?scp=0035390811&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0035390811&partnerID=8YFLogxK

U2 - 10.1109/12.936237

DO - 10.1109/12.936237

M3 - Article

AN - SCOPUS:0035390811

SN - 0018-9340

VL - 50

SP - 709

EP - 723

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

IS - 7

ER -

A high-bandwidth memory pipeline for wide issue processors

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this