TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices
Li, Zonghang ; Feng, Wenjiao ; Guizani, Mohsen ; Yu, Hongfang
Li, Zonghang
Feng, Wenjiao
Guizani, Mohsen
Yu, Hongfang
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
LLM serving is shifting from cloud to edge due to privacy concerns over user interaction data. However, mobile devices struggle with very limited computing power and memory, requiring collaboration among multiple devices to run LLM apps. The mainstream solution, pipeline parallelism, is inefficient for such cases because mobile devices typically run only one inference task at a time. This paper argues that tensor parallelism, despite its high communication cost, can better fit such scenarios. We introduce TPI-LLM, a compute and memory-efficient tensor parallel inference system designed to run 70B-scale LLMs on low-resource mobile devices. It keeps sensitive raw data local on users' devices and employs a sliding window memory scheduler to dynamically manage layer weights. It overlaps disk I/O with computation and communication, enabling efficient operation of large models on memory-limited devices. Extensive experiments show that TPI-LLM reduces token latency by 80%–90% compared to Transformers, Accelerate, and Galaxy. It also cuts the peak memory footprint by 90%, requiring just 3.1 GiB of memory for 70B-scale models. The code is open available at https://github.com/Lizonghang/TPI-LLM.
Citation
Z. Li, W. Feng, M. Guizani and H. Yu, "TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Mobile Devices," in IEEE Transactions on Services Computing, doi: 10.1109/TSC.2025.3596892
Source
IEEE Transactions on Services Computing
Conference
Keywords
Subjects
Source
Publisher
IEEE
