TY - JOUR
T1 - Cooperative Distributed GPU Power Capping for Deep Learning Clusters
AU - Kang, Dong Ki
AU - Ha, Yun Gi
AU - Peng, Limei
AU - Youn, Chan Hyun
N1 - Publisher Copyright:
© 1982-2012 IEEE.
PY - 2022/7/1
Y1 - 2022/7/1
N2 - The recent GPU-based clusters that handle deep learning (DL) tasks have the features of GPU device heterogeneity, a variety of deep neural network (DNN) models, and high computational complexity. Thus, the traditional power capping methods for CPU-based clusters or small-scale GPU devices cannot be applied to the GPU-based clusters handling DL tasks. This article develops a cooperative distributed GPU power capping (CD-GPC) system for GPU-based clusters, aiming to minimize the training completion time of invoked DL tasks without exceeding the limited power budget. Specifically, we first design the frequency scaling approach using the online model estimation based on the recursive least square method. This approach achieves the accurate tuning for DL task training time and power usage of GPU devices without needing offline profiling. Then, we formulate the proposed FS problem as a Lagrangian dual decomposition-based economic model predictive control problem for large-scale heterogeneous GPU clusters. We conduct both the NVIDIA GPU-based lab-scale real experiments and real job trace-based simulation experiments for performance evaluation. Experimental results validate that the proposed system improves the power capping accuracy to have a mean absolute error of < 1%, and reduces the deadline violation ratio of invoked DL tasks by 21.5% compared with other recent counterparts.
AB - The recent GPU-based clusters that handle deep learning (DL) tasks have the features of GPU device heterogeneity, a variety of deep neural network (DNN) models, and high computational complexity. Thus, the traditional power capping methods for CPU-based clusters or small-scale GPU devices cannot be applied to the GPU-based clusters handling DL tasks. This article develops a cooperative distributed GPU power capping (CD-GPC) system for GPU-based clusters, aiming to minimize the training completion time of invoked DL tasks without exceeding the limited power budget. Specifically, we first design the frequency scaling approach using the online model estimation based on the recursive least square method. This approach achieves the accurate tuning for DL task training time and power usage of GPU devices without needing offline profiling. Then, we formulate the proposed FS problem as a Lagrangian dual decomposition-based economic model predictive control problem for large-scale heterogeneous GPU clusters. We conduct both the NVIDIA GPU-based lab-scale real experiments and real job trace-based simulation experiments for performance evaluation. Experimental results validate that the proposed system improves the power capping accuracy to have a mean absolute error of < 1%, and reduces the deadline violation ratio of invoked DL tasks by 21.5% compared with other recent counterparts.
KW - Deep learning (DL) cluster
KW - Economic model predictive control (EMPC)
KW - GPU power capping
KW - Lagrangian dual decomposition
KW - Lipschitz continuity
UR - http://www.scopus.com/inward/record.url?scp=85110848368&partnerID=8YFLogxK
U2 - 10.1109/TIE.2021.3095790
DO - 10.1109/TIE.2021.3095790
M3 - Article
AN - SCOPUS:85110848368
SN - 0278-0046
VL - 69
SP - 7244
EP - 7254
JO - IEEE Transactions on Industrial Electronics
JF - IEEE Transactions on Industrial Electronics
IS - 7
ER -