Item

Synergy: System for Co-adaptive Goodput-Based Scheduling and Hybrid Parallelism of Deep Learning Jobs

Sakip, Akhmed
Department
Machine Learning
Embargo End Date
2024-01-01
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In the realm of deep learning (DL), the development of increasingly complex models, such as large language models (LLMs), has escalated the need for sophisticated distributed computing approaches. To effectively train these expansive models, distributed deep learning across multiple graphics processing units (GPUs) is essential, aiming to reduce both computational time and memory requirements. However, a significant research challenge arises from the current limitations of scheduling systems. These systems often struggle to effectively allocate resources to DL jobs sharing the same clusters, lacking sensitivity to job-specific training progress and hyperparameters, while only considering a single view of distributed DL - data parallelism. This gap underscores a critical need for integrated solutions that combine advanced model parallelization with dynamic, goodput-aware scheduling to support the efficient training of large-scale DL models. This work presents Synergy, a system designed to improve the training of large DL models by integrating co-adaptive goodput-based scheduling with automatic 3D parallelism. Synergy aims to enhance training efficiency by focusing on goodput, a metric that optimizes both system throughput and statistical efficiency of training. This approach ensures an efficient use of computational resources, accelerating the convergence of DL jobs. The key contribution of Synergy is its architecture, which supports the demanding requirements of large model training through adaptive scheduling and parallelism. Synergy is structured into two main components: SynergyTask and SynergyScheduler. SynergyTask manages hyperparameter tuning and automatic parallelism for each DL job, while SynergyScheduler allocates resources across multiple tasks, optimizing for goodput. This structure makes Synergy user-friendly, requiring no user expertise in model parallelization or hyperparameter tuning in response to using distributed DL.
Citation
A. Sakip, "Synergy: System for Co-adaptive Goodput-Based Scheduling and Hybrid Parallelism of Deep Learning Jobs", MS. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2024
Source
Conference
Keywords
Subjects
Source
Publisher
DOI
Full-text link