Item

Active Prompt Caching in Edge Networks for Generative AI and LLMs: An RL-Based Approach

Baccour, Emna
Erbad, Aiman
Mohamed, Amr
Hamdi, Mounir
Guizani, Mohsen
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Generative AI (GAI) and Large Language Models (LLMs) have revolutionized natural language processing and content creation. However, their significant computational demands during inference often require cloud servers, which are currently the only viable option for handling complex multi-modal models like GPT-4. The inherent complexity of these models increases latency, posing challenges even within cloud environments. Furthermore, cloud reliance brings other challenges, including high bandwidth consumption to transfer diverse data types. Worse, in personalized GAI applications like virtual assistants, similar prompts frequently occur, causing redundant transmission and computation of replies, which further increases overhead. Accelerating the inference of multi-modal systems is, therefore, critical in artificial intelligence. In this paper, we aim to improve the inference efficiency through prompt caching; if a current prompt is semantically similar to a previous one, the system can reuse the earlier response without invoking the model again. We leverage collaborative edge computing to cache popular replies and store their request embeddings. New prompts are locally processed to extract embeddings, with their qualities determined by the resources available on edge servers. Our problem is formulated as an optimization to manage offloading decisions for GAI tasks, aiming to avoid cloud inferences and minimize latency while maximizing reply quality. Given its non-convex nature, we propose to solve it via Block Successive Upper Bound Minimization (BSUM). Reinforcement learning is employed to actively pre-cache prompts, tackling the complexity of unknown prompt popularity. Our approach demonstrates near-optimal performance, significantly outperforming cloud-only solutions.
Citation
E. Baccour, A. Erbad, A. Mohamed, M. Hamdi and M. Guizani, "Active Prompt Caching in Edge Networks for Generative AI and LLMs: An RL-Based Approach," 2025 IEEE Wireless Communications and Networking Conference (WCNC), Milan, Italy, 2025, pp. 01-07, doi: 10.1109/WCNC61545.2025.10978306
Source
2025 IEEE Wireless Communications and Networking Conference (WCNC), 2025
Conference
IEEE Conference on Wireless Communications and Networking, 2025
Keywords
Generative AI, LLM, Collaborative Edge Computing, Prompts Caching, BSUM, RL
Subjects
Source
IEEE Conference on Wireless Communications and Networking, 2025
Publisher
IEEE
Full-text link