Abstract
Deep neural network (DNN) training is generally performed by cloud computing platforms. However, cloud-based training has several problems such as network bottleneck, server management cost, and privacy. To overcome these problems, one of the most promising solutions is distributed DNN model training which trains the model with not only high-performance servers but also low-end power-efficient mobile edge or user devices. However, due to the lack of a framework which can provide an optimal cluster configuration (i.e., determining which computing devices participate in DNN training tasks), it is difficult to perform efficient DNN model training considering DNN service providers' preferences such as training time or energy efficiency. In this paper, we introduce a novel framework for distributed DNN training that determines the best training cluster configuration with available heterogeneous computing resources. Our proposed framework utilizes pre-Training with a small number of training steps and estimates training time, power, energy, and energy-delay product (EDP) for each possible training cluster configuration. Based on the estimated metrics, our framework performs DNN training for the remaining steps with the chosen best cluster configurations depending on DNN service providers' preferences. Our framework is implemented in TensorFlow and evaluated with three heterogeneous computing platforms and five widely used DNN models. According to our experimental results, in 76.67% of the cases, our framework chooses the best cluster configuration depending on DNN service providers' preferences with only a small training time overhead.
| Original language | English |
|---|---|
| Title of host publication | Proceedings - 2019 IEEE 25th International Conference on Parallel and Distributed Systems, ICPADS 2019 |
| Publisher | IEEE Computer Society |
| Pages | 430-437 |
| Number of pages | 8 |
| ISBN (Electronic) | 9781728125831 |
| DOIs | |
| State | Published - Dec 2019 |
| Event | 25th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2019 - Tianjin, China Duration: 4 Dec 2019 → 6 Dec 2019 |
Publication series
| Name | Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS |
|---|---|
| Volume | 2019-December |
| ISSN (Print) | 1521-9097 |
Conference
| Conference | 25th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2019 |
|---|---|
| Country/Territory | China |
| City | Tianjin |
| Period | 4/12/19 → 6/12/19 |
UN SDGs
This output contributes to the following UN Sustainable Development Goals (SDGs)
-
SDG 7 Affordable and Clean Energy
Keywords
- Deep neural network
- Distributed processing
- Edge Computing
- Energy efficiency
- Training time
Fingerprint
Dive into the research topics of 'A Framework for distributed deep neural network training with heterogeneous computing platforms'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver