Deep learning is a popular technique for building models from large quantities of input data for applications in many domains. With the proliferation of edge devices such as sensor and mobile devices, large volumes of data are generated at rapid pace all over the world. Migrating large amounts of data into centralized data center(s) over WAN environments is often infeasible due to cost, performance or privacy reasons. Moreover, there is an increasing need for incremental or online deep learning over newly generated data in real-time. These trends require rethinking of the traditional training approach to deep learning. To handle the computation on distributed input data, micro-clouds—small-scale clouds deployed near edge devices in many different locations—provide an attractive alternative for data locality reasons. However, existing distributed deep learning systems do not support training in micro-clouds, due to the unique characteristics and challenges in this environment. In this paper, we examine the key challenges of deep learning in micro-clouds: computation and network resource heterogeneity at inter- and intra micro-cloud levels and their scale. We present DLion, a decentralized distributed deep learning system for such environments. It employs techniques specifically designed to address the above challenges to reduce training time, enhance model accuracy, and provide system scalability. We have implemented a prototype of DLion in TensorFlow and our preliminary experiments show promising results towards achieving accurate and efficient distributed deep learning in micro-clouds.
|Original language||English (US)|
|State||Published - 2019|
|Event||11th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2019 - Renton, United States|
Duration: Jul 8 2019 → …
|Conference||11th USENIX Workshop on Hot Topics in Cloud Computing, HotCloud 2019|
|Period||7/8/19 → …|