Hybrid Binary-Unary Hardware Accelerator

S. Rasoul Faraji; Kia Bazargan

doi:10.1109/TC.2020.2971596

Hybrid Binary-Unary Hardware Accelerator

S. Rasoul Faraji, Kia Bazargan

Electrical and Computer Engineering

Research output: Contribution to journal › Article › peer-review

13 Scopus citations

Abstract

Stream-based computing such as stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the area saving comes at an exponential price in latency, making the area × delay cost unattractive. In this article, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or close-to-optimal design in the design space. To the best of our knowledge, we are the first to show a scalable method based on parallel bit-stream data representation that can beat conventional binary in terms of a real cost, i.e., area × delay and energy consumption in almost all functions that we tried at resolutions of 8-, 10-, and 12-bits. Our method outperforms the binary, stochastic, and fully unary methods on a number of functions, especially low-cost binary CORDIC-based functions, and on a common edge detection algorithm on FPGA and in ASIC implementation. In terms of area × delay cost, our {on FPGA, in ASIC} cost is on average only {4.72\4.72, 24.36\24.36} and {20.16\20.16, 60.12\60.12} of the parallel binary pipeline implementation at 8- and 10-bit resolution, respectively. These numbers are 2-3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the parallel CORDIC-based pipeline binary method for high-resolution (12-bit), highly oscillating functions such as \sin (15 x)sin(15x). However, for complex functions like gammagamma function, the proposed method can beat any other methods in terms of area × delay, throughput, latency, and energy per sample costs. To implement the Roberts cross edge detection algorithm, the proposed method takes 5.7 and 39.45 percent of the area × delay cost of FPGA and ASIC implementation of the binary method, respectively. In terms of energy efficiency for FPGA implementation, our method uses only 8.4, 12.7, and 27.7 percent of the energy per sample usage of serial binary implementations at 8-, 10-, and 12-bit resolutions, respectively. These numbers change to 23.9, 38.54, and 99.3 percent compared to parallel binary implementations.

Original language	English (US)
Article number	8981875
Pages (from-to)	1308-1319
Number of pages	12
Journal	IEEE Transactions on Computers
Volume	69
Issue number	9
DOIs	https://doi.org/10.1109/TC.2020.2971596
State	Published - Sep 1 2020

Bibliographical note

Publisher Copyright:
© 1968-2012 IEEE.

Keywords

CORDIC
Hybrid computing system
alternator logic
edge detection
hardware accelerators
scaling network
stochastic computing
unary computing system

Access

10.1109/TC.2020.2971596

OpenUrl availability

Full text

Cite this

@article{0c08365873034e8c81290626f045b99a,

title = "Hybrid Binary-Unary Hardware Accelerator",

abstract = "Stream-based computing such as stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the area saving comes at an exponential price in latency, making the area × delay cost unattractive. In this article, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or close-to-optimal design in the design space. To the best of our knowledge, we are the first to show a scalable method based on parallel bit-stream data representation that can beat conventional binary in terms of a real cost, i.e., area × delay and energy consumption in almost all functions that we tried at resolutions of 8-, 10-, and 12-bits. Our method outperforms the binary, stochastic, and fully unary methods on a number of functions, especially low-cost binary CORDIC-based functions, and on a common edge detection algorithm on FPGA and in ASIC implementation. In terms of area × delay cost, our {on FPGA, in ASIC} cost is on average only {4.72\4.72, 24.36\24.36} and {20.16\20.16, 60.12\60.12} of the parallel binary pipeline implementation at 8- and 10-bit resolution, respectively. These numbers are 2-3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the parallel CORDIC-based pipeline binary method for high-resolution (12-bit), highly oscillating functions such as \sin (15 x)sin(15x). However, for complex functions like gammagamma function, the proposed method can beat any other methods in terms of area × delay, throughput, latency, and energy per sample costs. To implement the Roberts cross edge detection algorithm, the proposed method takes 5.7 and 39.45 percent of the area × delay cost of FPGA and ASIC implementation of the binary method, respectively. In terms of energy efficiency for FPGA implementation, our method uses only 8.4, 12.7, and 27.7 percent of the energy per sample usage of serial binary implementations at 8-, 10-, and 12-bit resolutions, respectively. These numbers change to 23.9, 38.54, and 99.3 percent compared to parallel binary implementations. ",

keywords = "CORDIC, Hybrid computing system, alternator logic, edge detection, hardware accelerators, scaling network, stochastic computing, unary computing system",

author = "Faraji, {S. Rasoul} and Kia Bazargan",

note = "Publisher Copyright: {\textcopyright} 1968-2012 IEEE.",

year = "2020",

month = sep,

day = "1",

doi = "10.1109/TC.2020.2971596",

language = "English (US)",

volume = "69",

pages = "1308--1319",

journal = "IEEE Transactions on Computers",

issn = "0018-9340",

publisher = "IEEE Computer Society",

number = "9",

}

TY - JOUR

T1 - Hybrid Binary-Unary Hardware Accelerator

AU - Faraji, S. Rasoul

AU - Bazargan, Kia

PY - 2020/9/1

Y1 - 2020/9/1

N2 - Stream-based computing such as stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the area saving comes at an exponential price in latency, making the area × delay cost unattractive. In this article, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or close-to-optimal design in the design space. To the best of our knowledge, we are the first to show a scalable method based on parallel bit-stream data representation that can beat conventional binary in terms of a real cost, i.e., area × delay and energy consumption in almost all functions that we tried at resolutions of 8-, 10-, and 12-bits. Our method outperforms the binary, stochastic, and fully unary methods on a number of functions, especially low-cost binary CORDIC-based functions, and on a common edge detection algorithm on FPGA and in ASIC implementation. In terms of area × delay cost, our {on FPGA, in ASIC} cost is on average only {4.72\4.72, 24.36\24.36} and {20.16\20.16, 60.12\60.12} of the parallel binary pipeline implementation at 8- and 10-bit resolution, respectively. These numbers are 2-3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the parallel CORDIC-based pipeline binary method for high-resolution (12-bit), highly oscillating functions such as \sin (15 x)sin(15x). However, for complex functions like gammagamma function, the proposed method can beat any other methods in terms of area × delay, throughput, latency, and energy per sample costs. To implement the Roberts cross edge detection algorithm, the proposed method takes 5.7 and 39.45 percent of the area × delay cost of FPGA and ASIC implementation of the binary method, respectively. In terms of energy efficiency for FPGA implementation, our method uses only 8.4, 12.7, and 27.7 percent of the energy per sample usage of serial binary implementations at 8-, 10-, and 12-bit resolutions, respectively. These numbers change to 23.9, 38.54, and 99.3 percent compared to parallel binary implementations.

AB - Stream-based computing such as stochastic computing has been used in recent years to create designs with significantly smaller area by harnessing unary encoding of data. However, the area saving comes at an exponential price in latency, making the area × delay cost unattractive. In this article, we present a novel method which uses a hybrid binary / unary representation to perform computations. We first divide the input range into a few sub-regions, perform unary computations on each sub-region individually, and finally pack the outputs of all sub-regions back to compact binary. Moreover, we propose a synthesis methodology and a regression model to predict an optimal or close-to-optimal design in the design space. To the best of our knowledge, we are the first to show a scalable method based on parallel bit-stream data representation that can beat conventional binary in terms of a real cost, i.e., area × delay and energy consumption in almost all functions that we tried at resolutions of 8-, 10-, and 12-bits. Our method outperforms the binary, stochastic, and fully unary methods on a number of functions, especially low-cost binary CORDIC-based functions, and on a common edge detection algorithm on FPGA and in ASIC implementation. In terms of area × delay cost, our {on FPGA, in ASIC} cost is on average only {4.72\4.72, 24.36\24.36} and {20.16\20.16, 60.12\60.12} of the parallel binary pipeline implementation at 8- and 10-bit resolution, respectively. These numbers are 2-3 orders of magnitude better than the results of traditional stochastic methods. Our method is not competitive with the parallel CORDIC-based pipeline binary method for high-resolution (12-bit), highly oscillating functions such as \sin (15 x)sin(15x). However, for complex functions like gammagamma function, the proposed method can beat any other methods in terms of area × delay, throughput, latency, and energy per sample costs. To implement the Roberts cross edge detection algorithm, the proposed method takes 5.7 and 39.45 percent of the area × delay cost of FPGA and ASIC implementation of the binary method, respectively. In terms of energy efficiency for FPGA implementation, our method uses only 8.4, 12.7, and 27.7 percent of the energy per sample usage of serial binary implementations at 8-, 10-, and 12-bit resolutions, respectively. These numbers change to 23.9, 38.54, and 99.3 percent compared to parallel binary implementations.

KW - CORDIC

KW - Hybrid computing system

KW - alternator logic

KW - edge detection

KW - hardware accelerators

KW - scaling network

KW - stochastic computing

KW - unary computing system

UR - http://www.scopus.com/inward/record.url?scp=85089506144&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85089506144&partnerID=8YFLogxK

U2 - 10.1109/TC.2020.2971596

DO - 10.1109/TC.2020.2971596

M3 - Article

AN - SCOPUS:85089506144

SN - 0018-9340

VL - 69

SP - 1308

EP - 1319

JO - IEEE Transactions on Computers

JF - IEEE Transactions on Computers

IS - 9

M1 - 8981875

ER -

Hybrid Binary-Unary Hardware Accelerator

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this