Co-adaptive Scheduling with 3D Parallelization and Gradient Compression
Abdelfattah Sayedelahl, Omar Baha Abdelmoneim
Abdelfattah Sayedelahl, Omar Baha Abdelmoneim
Supervisor
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
AI research has become much more critical and promising than ever, thanks to the availability of hardware resources, especially GPUs. Despite their high cost, companies and research institutes invest heavily in these resources since they are one of the main limitations in advancing AI research. As AI models, like Large Language Models (LLMs), and training datasets grow larger in size, the need for a more efficient and faster training process becomes more crucial. Many works aimed to improve the training process, like distributed training, which allows the training process to be distributed across multiple GPUs or even multiple machines. However, the communication time among these devices is a bottleneck in the training process. One of the latest works to address this issue is L-GreCo[1], which does a layerwise gradient compression among workers in a data parallel (DP) setting to reduce the communication overhead without sacrificing the model’s performance since some layers affect the training accuracy more than others. However, this work does not account for other forms of parallelism or cluster scheduling settings, which are very popular in research institutes. In this work, we propose a new system integrating coadaptive goodput scheduling, as introduced in Adapt DL[2], with automatic 3D parallelism and layerwise gradient quantization. We explore the space of different compression techniques, each with varying compression parameters, and scheduling strategies to improve the training process’s efficiency of each model and the cluster utilization as a whole. The system uses goodput, a metric optimizing the throughput and the statistical efficiency of training, to schedule the training process and coadaptively adjust the batch size and learning rate. We find that L-GreCo is unsuitable for many hardware settings where the communication bandwidth is high enough that the savings done by L-GreCo do not justify the overhead it incurs. We also show that naively combining gradient compression with Pollux’s[2] batch size autoscaling does not provide the improvements it promises, indicating the need for coadaptivity between gradient compression and batch size scaling.
Citation
Omar Baha Abdelmoneim Abdelfattah Sayedelahl, “Co-adaptive Scheduling with 3D Parallelization and Gradient Compression,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Scheduling, Gradient Compression, 3D Parallelism, ML Systems
