Code transformations for enhancing the performance of speculatively parallel threads

Shengyue Wang; Pen Chung Yew; Antonia Zhai

doi:10.1142/S0218126612400087

Code transformations for enhancing the performance of speculatively parallel threads

Shengyue Wang, Pen Chung Yew, Antonia Zhai

Computer Science and Engineering

Research output: Contribution to journal › Article › peer-review

5 Scopus citations

Abstract

As technology advances, microprocessors that integrate multiple cores on a single chip are becoming increasingly common. How to use these processors to improve the performance of a single program has been a challenge. For general-purpose applications, it is especially difficult to create efficient parallel execution due to the complex control flow and ambiguous data dependences. Thread-level speculation and transactional memory provide two hardware mechanisms that are able to optimistically parallelize potentially dependent threads. However, a compiler that performs detailed performance trade-off analysis is essential for generating efficient parallel programs for these hardwares. This compiler must be able to take into consideration the cost of intra-thread as well as inter-thread value communication. On the other hand, the ubiquitous existence of complex, input-dependent control flow and data dependence patterns in general-purpose applications makes it impossible to have one technique optimize all program patterns. In this paper, we propose three optimization techniques to improve the thread performance: (i) scheduling instruction and generating recovery code to reduce the critical forwarding path introduced by synchronizing memory resident values; (ii) identifying reduction variables and transforming the code the minimize the serializing execution; and (iii) dynamically merging consecutive iterations of a loop to avoid stalls due to unbalanced workload. Detailed evaluation of the proposed mechanism shows that each optimization technique improves a subset but none improve all of the SPEC2000 benchmarks. On average, the proposed optimizations improve the performance by 7% for the set of the SPEC2000 benchmarks that have already been optimized for register-resident value communication.

Original language	English (US)
Article number	1240008
Journal	Journal of Circuits, Systems and Computers
Volume	21
Issue number	2
DOIs	https://doi.org/10.1142/S0218126612400087
State	Published - Apr 2012

Bibliographical note

Funding Information:
This work is supported in part by grants from National Science Foundation under CNS-0834599, CSR-0834599, and CPS-0931931, a contract from Semiconductor Research Corporation under SRC-2008-TJ-1819, and gift grants from HP, IBM, and Intel.

Keywords

Thread-level speculation
compiler optimizations
multicore systems
parallelizing compiler

Access

10.1142/S0218126612400087

OpenUrl availability

Full text

Cite this

@article{dee5e806aaa9457095a4d18379e0fb3d,

title = "Code transformations for enhancing the performance of speculatively parallel threads",

abstract = "As technology advances, microprocessors that integrate multiple cores on a single chip are becoming increasingly common. How to use these processors to improve the performance of a single program has been a challenge. For general-purpose applications, it is especially difficult to create efficient parallel execution due to the complex control flow and ambiguous data dependences. Thread-level speculation and transactional memory provide two hardware mechanisms that are able to optimistically parallelize potentially dependent threads. However, a compiler that performs detailed performance trade-off analysis is essential for generating efficient parallel programs for these hardwares. This compiler must be able to take into consideration the cost of intra-thread as well as inter-thread value communication. On the other hand, the ubiquitous existence of complex, input-dependent control flow and data dependence patterns in general-purpose applications makes it impossible to have one technique optimize all program patterns. In this paper, we propose three optimization techniques to improve the thread performance: (i) scheduling instruction and generating recovery code to reduce the critical forwarding path introduced by synchronizing memory resident values; (ii) identifying reduction variables and transforming the code the minimize the serializing execution; and (iii) dynamically merging consecutive iterations of a loop to avoid stalls due to unbalanced workload. Detailed evaluation of the proposed mechanism shows that each optimization technique improves a subset but none improve all of the SPEC2000 benchmarks. On average, the proposed optimizations improve the performance by 7% for the set of the SPEC2000 benchmarks that have already been optimized for register-resident value communication.",

keywords = "Thread-level speculation, compiler optimizations, multicore systems, parallelizing compiler",

author = "Shengyue Wang and Yew, {Pen Chung} and Antonia Zhai",

note = "Funding Information: This work is supported in part by grants from National Science Foundation under CNS-0834599, CSR-0834599, and CPS-0931931, a contract from Semiconductor Research Corporation under SRC-2008-TJ-1819, and gift grants from HP, IBM, and Intel.",

year = "2012",

month = apr,

doi = "10.1142/S0218126612400087",

language = "English (US)",

volume = "21",

journal = "Journal of Circuits, Systems and Computers",

issn = "0218-1266",

publisher = "World Scientific Publishing Co. Pte Ltd",

number = "2",

}

TY - JOUR

T1 - Code transformations for enhancing the performance of speculatively parallel threads

AU - Wang, Shengyue

AU - Yew, Pen Chung

AU - Zhai, Antonia

N1 - Funding Information: This work is supported in part by grants from National Science Foundation under CNS-0834599, CSR-0834599, and CPS-0931931, a contract from Semiconductor Research Corporation under SRC-2008-TJ-1819, and gift grants from HP, IBM, and Intel.

PY - 2012/4

Y1 - 2012/4

N2 - As technology advances, microprocessors that integrate multiple cores on a single chip are becoming increasingly common. How to use these processors to improve the performance of a single program has been a challenge. For general-purpose applications, it is especially difficult to create efficient parallel execution due to the complex control flow and ambiguous data dependences. Thread-level speculation and transactional memory provide two hardware mechanisms that are able to optimistically parallelize potentially dependent threads. However, a compiler that performs detailed performance trade-off analysis is essential for generating efficient parallel programs for these hardwares. This compiler must be able to take into consideration the cost of intra-thread as well as inter-thread value communication. On the other hand, the ubiquitous existence of complex, input-dependent control flow and data dependence patterns in general-purpose applications makes it impossible to have one technique optimize all program patterns. In this paper, we propose three optimization techniques to improve the thread performance: (i) scheduling instruction and generating recovery code to reduce the critical forwarding path introduced by synchronizing memory resident values; (ii) identifying reduction variables and transforming the code the minimize the serializing execution; and (iii) dynamically merging consecutive iterations of a loop to avoid stalls due to unbalanced workload. Detailed evaluation of the proposed mechanism shows that each optimization technique improves a subset but none improve all of the SPEC2000 benchmarks. On average, the proposed optimizations improve the performance by 7% for the set of the SPEC2000 benchmarks that have already been optimized for register-resident value communication.

AB - As technology advances, microprocessors that integrate multiple cores on a single chip are becoming increasingly common. How to use these processors to improve the performance of a single program has been a challenge. For general-purpose applications, it is especially difficult to create efficient parallel execution due to the complex control flow and ambiguous data dependences. Thread-level speculation and transactional memory provide two hardware mechanisms that are able to optimistically parallelize potentially dependent threads. However, a compiler that performs detailed performance trade-off analysis is essential for generating efficient parallel programs for these hardwares. This compiler must be able to take into consideration the cost of intra-thread as well as inter-thread value communication. On the other hand, the ubiquitous existence of complex, input-dependent control flow and data dependence patterns in general-purpose applications makes it impossible to have one technique optimize all program patterns. In this paper, we propose three optimization techniques to improve the thread performance: (i) scheduling instruction and generating recovery code to reduce the critical forwarding path introduced by synchronizing memory resident values; (ii) identifying reduction variables and transforming the code the minimize the serializing execution; and (iii) dynamically merging consecutive iterations of a loop to avoid stalls due to unbalanced workload. Detailed evaluation of the proposed mechanism shows that each optimization technique improves a subset but none improve all of the SPEC2000 benchmarks. On average, the proposed optimizations improve the performance by 7% for the set of the SPEC2000 benchmarks that have already been optimized for register-resident value communication.

KW - Thread-level speculation

KW - compiler optimizations

KW - multicore systems

KW - parallelizing compiler

UR - http://www.scopus.com/inward/record.url?scp=84862224136&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84862224136&partnerID=8YFLogxK

U2 - 10.1142/S0218126612400087

DO - 10.1142/S0218126612400087

M3 - Article

AN - SCOPUS:84862224136

SN - 0218-1266

VL - 21

JO - Journal of Circuits, Systems and Computers

JF - Journal of Circuits, Systems and Computers

IS - 2

M1 - 1240008

ER -

Code transformations for enhancing the performance of speculatively parallel threads

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this