H-PS: A Heterogeneous-Aware Parameter Server with Distributed Neural Network Training

Lintao Xian; Bingzhe Li; Jing Liu; Zhongwen Guo; David H.C. Du

doi:10.1109/ACCESS.2021.3060154

H-PS: A Heterogeneous-Aware Parameter Server with Distributed Neural Network Training

Lintao Xian, Bingzhe Li, Jing Liu, Zhongwen Guo, David H.C. Du

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Deep neural networks have become one of the popular techniques used in many research and application areas including computer vision, natural language processing, etc. As the complexity of neural networks continuously increasing, the training process takes a much longer time and requires more computation resources. To speed up the training process, a centralized distributed training structure named Parameter Server (PS) is widely used to assign training tasks to different workers/nodes. Most existing studies considered all workers having the same computation resources. However, in a heterogeneous environment, fast workers (i.e., workers having more computation resources) can complete tasks earlier than slow workers and thus the system does not fully utilize the resources of fast workers. In this paper, we propose a PS model with heterogeneous types of workers/nodes, called H-PS, which can fully utilize the resources of each worker by dynamically scheduling tasks based on the current status of the workers (e.g., available memory). By doing so, the workers will complete their tasks at the same time and the stragglers (i.e., workers fall behind others) can be avoided. In addition, a pipeline scheme is proposed to further improve the effectiveness of workers by fully utilizing the resources of workers during the time of parameters transmitting between PS and workers. Moreover, a flexible quantization scheme is proposed to reduce the communication overhead between the PS and workers. Finally, the H-PS is implemented using Containers which is an emerging lightweight technology. The experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x - 3.5x when compared with existing methods.

Original language	English (US)
Article number	9356607
Pages (from-to)	44049-44058
Number of pages	10
Journal	IEEE Access
Volume	9
DOIs	https://doi.org/10.1109/ACCESS.2021.3060154
State	Published - 2021
Externally published	Yes

Bibliographical note

Publisher Copyright:
© 2013 IEEE.

Keywords

Distributed machine learning (DML)
dynamic quantization parameter
dynamically scheduling tasks
heterogeneous environments
pipeline communication and computation

Access

10.1109/ACCESS.2021.3060154

OpenUrl availability

Full text

Cite this

@article{3a53c194b34b43b5bff4085de211069c,

title = "H-PS: A Heterogeneous-Aware Parameter Server with Distributed Neural Network Training",

abstract = "Deep neural networks have become one of the popular techniques used in many research and application areas including computer vision, natural language processing, etc. As the complexity of neural networks continuously increasing, the training process takes a much longer time and requires more computation resources. To speed up the training process, a centralized distributed training structure named Parameter Server (PS) is widely used to assign training tasks to different workers/nodes. Most existing studies considered all workers having the same computation resources. However, in a heterogeneous environment, fast workers (i.e., workers having more computation resources) can complete tasks earlier than slow workers and thus the system does not fully utilize the resources of fast workers. In this paper, we propose a PS model with heterogeneous types of workers/nodes, called H-PS, which can fully utilize the resources of each worker by dynamically scheduling tasks based on the current status of the workers (e.g., available memory). By doing so, the workers will complete their tasks at the same time and the stragglers (i.e., workers fall behind others) can be avoided. In addition, a pipeline scheme is proposed to further improve the effectiveness of workers by fully utilizing the resources of workers during the time of parameters transmitting between PS and workers. Moreover, a flexible quantization scheme is proposed to reduce the communication overhead between the PS and workers. Finally, the H-PS is implemented using Containers which is an emerging lightweight technology. The experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x - 3.5x when compared with existing methods.",

keywords = "Distributed machine learning (DML), dynamic quantization parameter, dynamically scheduling tasks, heterogeneous environments, pipeline communication and computation",

author = "Lintao Xian and Bingzhe Li and Jing Liu and Zhongwen Guo and Du, {David H.C.}",

note = "Publisher Copyright: {\textcopyright} 2013 IEEE.",

year = "2021",

doi = "10.1109/ACCESS.2021.3060154",

language = "English (US)",

volume = "9",

pages = "44049--44058",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - H-PS

T2 - A Heterogeneous-Aware Parameter Server with Distributed Neural Network Training

AU - Xian, Lintao

AU - Li, Bingzhe

AU - Liu, Jing

AU - Guo, Zhongwen

AU - Du, David H.C.

PY - 2021

Y1 - 2021

N2 - Deep neural networks have become one of the popular techniques used in many research and application areas including computer vision, natural language processing, etc. As the complexity of neural networks continuously increasing, the training process takes a much longer time and requires more computation resources. To speed up the training process, a centralized distributed training structure named Parameter Server (PS) is widely used to assign training tasks to different workers/nodes. Most existing studies considered all workers having the same computation resources. However, in a heterogeneous environment, fast workers (i.e., workers having more computation resources) can complete tasks earlier than slow workers and thus the system does not fully utilize the resources of fast workers. In this paper, we propose a PS model with heterogeneous types of workers/nodes, called H-PS, which can fully utilize the resources of each worker by dynamically scheduling tasks based on the current status of the workers (e.g., available memory). By doing so, the workers will complete their tasks at the same time and the stragglers (i.e., workers fall behind others) can be avoided. In addition, a pipeline scheme is proposed to further improve the effectiveness of workers by fully utilizing the resources of workers during the time of parameters transmitting between PS and workers. Moreover, a flexible quantization scheme is proposed to reduce the communication overhead between the PS and workers. Finally, the H-PS is implemented using Containers which is an emerging lightweight technology. The experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x - 3.5x when compared with existing methods.

AB - Deep neural networks have become one of the popular techniques used in many research and application areas including computer vision, natural language processing, etc. As the complexity of neural networks continuously increasing, the training process takes a much longer time and requires more computation resources. To speed up the training process, a centralized distributed training structure named Parameter Server (PS) is widely used to assign training tasks to different workers/nodes. Most existing studies considered all workers having the same computation resources. However, in a heterogeneous environment, fast workers (i.e., workers having more computation resources) can complete tasks earlier than slow workers and thus the system does not fully utilize the resources of fast workers. In this paper, we propose a PS model with heterogeneous types of workers/nodes, called H-PS, which can fully utilize the resources of each worker by dynamically scheduling tasks based on the current status of the workers (e.g., available memory). By doing so, the workers will complete their tasks at the same time and the stragglers (i.e., workers fall behind others) can be avoided. In addition, a pipeline scheme is proposed to further improve the effectiveness of workers by fully utilizing the resources of workers during the time of parameters transmitting between PS and workers. Moreover, a flexible quantization scheme is proposed to reduce the communication overhead between the PS and workers. Finally, the H-PS is implemented using Containers which is an emerging lightweight technology. The experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x - 3.5x when compared with existing methods.

KW - Distributed machine learning (DML)

KW - dynamic quantization parameter

KW - dynamically scheduling tasks

KW - heterogeneous environments

KW - pipeline communication and computation

UR - http://www.scopus.com/inward/record.url?scp=85101749775&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85101749775&partnerID=8YFLogxK

U2 - 10.1109/ACCESS.2021.3060154

DO - 10.1109/ACCESS.2021.3060154

M3 - Article

AN - SCOPUS:85101749775

SN - 2169-3536

VL - 9

SP - 44049

EP - 44058

JO - IEEE Access

JF - IEEE Access

M1 - 9356607

ER -

H-PS: A Heterogeneous-Aware Parameter Server with Distributed Neural Network Training

Abstract

Bibliographical note

Keywords

Access

OpenUrl availability

Other files and links

Fingerprint

Cite this