Breaking the I/O Bottleneck: I/O Coordination Optimization for Efficient Large-Scale LLM Fine-Tuning
Shen, Ziyang ; Du, Hongchao ; Lin, Kaihuan ; Wu, Dawei ; Lin, Yin ; Luo, Ting ; Li, Qiao ; Xue, Chun Jason
Shen, Ziyang
Du, Hongchao
Lin, Kaihuan
Wu, Dawei
Lin, Yin
Luo, Ting
Li, Qiao
Xue, Chun Jason
Supervisor
Department
Computer Science
Embargo End Date
Type
Conference proceeding
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large Language Models (LLMs) with tens or even hundreds of billions of parameters have become the foundation of modern AI applications. However, fine-tuning such massive models is severely constrained by the limited GPU memory. Existing memory-saving systems, such as ZeRO-based offloading in DeepSpeed, reduce GPU memory usage but inevitably incur substantial I/O overhead, especially when model states reside on slow storage devices, such as NVMe SSDs. As a result, the memory bottleneck in large-scale fine-tuning is transformed into an I/O bottleneck. Although prior systems have employed strategies like parameter prefetching and partial asynchronous execution, they remain limited by synchronous I/O-communication dependencies and the lack of fine-grained read/write I/O scheduling. To address these limitations, we propose IOC, an I/O Coordination Optimization framework that maximizes pipeline parallelism across different phases of LLM fine-tuning. IOC introduces three key mechanisms: (1) An All-Gather prefetching technique based on an I/O state hash table, which completely decouples All-Gather prefetching from parameter I/O, achieving continuous overlap among I/O, communication, and computation; (2) The parameter update phase is refactored into an asynchronous pipeline with explicit I/O isolation, where the optimizer state write-back is executed in a semi-asynchronous manner, thereby mitigating read/write contention and reducing synchronization stalls; (3) Multi-disk parallelism is leveraged by introducing an additional disk to further relieve I/O contention and defer synchronization waits to the latest possible time point. Experimental results demonstrate that IOC significantly accelerates LLM finetuning while preserving low memory consumption. The end-toend fine-tuning time on the Llama-70B model is reduced by $\mathbf{2 1. 5 \%}$ and 34.3% in single-disk and multi-disk configurations compared to the baseline.
Citation
Z. Shen, H. Du, K. Lin, D. Wu, Y. Lin, T. Luo , et al., "Breaking the I/O Bottleneck: I/O Coordination Optimization for Efficient Large-Scale LLM Fine-Tuning," 2026, pp. 578-588.
Source
Conference
2026 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Keywords
46 Information and Computing Sciences, 4606 Distributed Computing and Systems Software
Subjects
Source
2026 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)
Publisher
IEEE
