Item

Which Cluster Meets My Deadline: A Budget-Aware Scheduler for Distributed Training Jobs in Heterogeneous Environments

Zhang, Yuchen
Luo, Long
Li, Zonghang
Sun, Gang
Yu, Hongfang
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Training deep learning (DL) models demands substantial computational resources, often relying on expensive GPUs in a distributed manner. To meet this demand, cloud providers deploy GPU clusters worldwide to offer users compute instance rental services. These GPU clusters are typically heterogeneous, comprising multiple GPU types with varying computational capabilities, and their prices vary significantly across both GPU instance types and geographic regions. Meanwhile, users often have specific deadlines for training their models. Most existing works focus solely on performance, overlooking price heterogeneity and failing to optimize costs effectively. Given the high cost and time demands of model training, focusing on performance alone is insufficient. In this paper, we aim to balance performance and cost through a DL broker service that maximizes the number of jobs completed within their deadlines under a given budget. We propose CADDS, which jointly optimizes cluster placement and dynamically adjusts GPU type and quantity during training to reduce rental costs. We formulate the scheduling problem as an integer nonlinear programming problem and propose an efficient online approach combining greedy and dynamic programming. Experimental results demonstrate that CADDS significantly outperforms existing approaches, improving the deadline satisfactory ratio within budget limits by 59.4% to 80.2%.
Citation
Y. Zhang, L. Luo, Z. Li, G. Sun and H. Yu, "Which Cluster Meets My Deadline: a Budget-Aware Scheduler for Distributed Training Jobs in Heterogeneous Environments," ICC 2025 - IEEE International Conference on Communications, Montreal, QC, Canada, 2025, pp. 1506-1511, doi: 10.1109/ICC52391.2025.11161040
Source
Conference Record - International Conference on Communications
Conference
2025 IEEE International Conference on Communications, ICC 2025
Keywords
budget, deadlines, DL training, heterogeneous clusters
Subjects
Source
2025 IEEE International Conference on Communications, ICC 2025
Publisher
IEEE
Full-text link