Deep neural networks have become one of the popular techniques used in many research and application areas including computer vision, natural language processing, etc. As the complexity of neural networks continuously increasing, the training process takes a much longer time and requires more computation resources. To speed up the training process, a centralized distributed training structure named Parameter Server (PS) is widely used to assign training tasks to different workers/nodes. Most existing studies considered all workers having the same computation resources. However, in a heterogeneous environment, fast workers (i.e., workers having more computation resources) can complete tasks earlier than slow workers and thus the system does not fully utilize the resources of fast workers. In this paper, we propose a PS model with heterogeneous types of workers/nodes, called H-PS, which can fully utilize the resources of each worker by dynamically scheduling tasks based on the current status of the workers (e.g., available memory). By doing so, the workers will complete their tasks at the same time and the stragglers (i.e., workers fall behind others) can be avoided. In addition, a pipeline scheme is proposed to further improve the effectiveness of workers by fully utilizing the resources of workers during the time of parameters transmitting between PS and workers. Moreover, a flexible quantization scheme is proposed to reduce the communication overhead between the PS and workers. Finally, the H-PS is implemented using Containers which is an emerging lightweight technology. The experimental results indicate that the proposed H-PS can reduce the overall training time by 1.4x - 3.5x when compared with existing methods.
Bibliographical noteFunding Information:
This work was supported in part by the Special Fund for Scientific Instruments of the National Natural Science Foundation of China under Grant 61827810, and in part by the Qingdao Agricultural University High-level Talents Research Fund under Grant 1119005.
© 2013 IEEE.
- Distributed machine learning (DML)
- dynamic quantization parameter
- dynamically scheduling tasks
- heterogeneous environments
- pipeline communication and computation