Memory-Optimized Offloading: Enabling LLM Fine-Tuning on Memory-Constrained Hardware
Lin, Kaihuan ; Du, Hongchao ; Wu, Dawei ; Lin, Yin ; Luo, Ting ; Li, Qiao ; Xue, Chun Jason
Lin, Kaihuan
Du, Hongchao
Wu, Dawei
Lin, Yin
Luo, Ting
Li, Qiao
Xue, Chun Jason
Supervisor
Department
Computer Science
Embargo End Date
Type
Conference proceeding
Date
License
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large language models (LLMs) have revolutionized natural language processing, enabling breakthroughs in applications ranging from machine translation to conversational artificial intelligence. However, fine-tuning these massive models presents significant memory challenges, especially when fine-tuning with mixed precision (FP16/FP32) to maintain efficiency and accuracy. Although existing techniques, such as the DeepSpeed fine-tuning framework, can reduce the overhead of host memory using group fine-tuning, these methods suffer from inefficient overflow detection and redundant gradient storage, greatly increasing memory requirements. These drawbacks limit the fine-tuning of large models on memory-constrained devices. To address these challenges, this paper proposes an instant gradient offloading strategy, a fine-tuning approach that significantly reduces memory overhead while maintaining fine-tuning accuracy and speed. Our solution introduces three key innovations: (1) On-the-fly overflow detection: calculated gradients are checked immediately instead of batch checking, eliminating peak memory spikes; (2) Instant gradient offloading: the gradient is offloaded to SSD and loaded when needed, minimizing CPU memory usage; (3) FP16-centric gradient offloading: the type conversion operation of the gradient data is delayed so that it uses FP16 format for data exchange with SSD. Experiments show that the proposed design reduces memory consumption by 77.8% − 85.3% compared to the DeepSpeed offload baseline while maintaining a comparable fine-tuning speed. Specifically, the proposed system can use 128GB of CPU memory to fine-tune a 98B parameter model. These advances make large-scale LLM fine-tuning possible on resource-constrained hardware.
Citation
K. Lin, H. Du, D. Wu, Y. Lin, T. Luo, Q. Li , et al., "Memory-Optimized Offloading: Enabling LLM Fine-Tuning on Memory-Constrained Hardware," 2026, pp. 36-41.
Source
2025 IEEE 14th Non-Volatile Memory Systems and Applications Symposium (NVMSA)
Conference
IEEE 14th Non-Volatile Memory Systems and Applications Symposium (NVMSA)
Keywords
46 Information and Computing Sciences, 4605 Data Management and Data Science
Subjects
Source
IEEE 14th Non-Volatile Memory Systems and Applications Symposium (NVMSA)
Publisher
IEEE
