Efficient Inference of Vision Instruction-Following Models with Elastic Cache
Liu, Zuyan ; Liu, Benlin ; Wang, Jiahui ; Dong, Yuhao ; Chen, Guangyi ; Rao, Yongming ; Krishna, Ranjay ; Lu, Jiwen
Liu, Zuyan
Liu, Benlin
Wang, Jiahui
Dong, Yuhao
Chen, Guangyi
Rao, Yongming
Krishna, Ranjay
Lu, Jiwen
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an ‘importance-driven cache merging’ strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their ‘distance’ with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at https://github.com/liuzuyan/ElasticCache.
Citation
Z. Liu et al., “Efficient Inference of Vision Instruction-Following Models with Elastic Cache,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , vol. 15075 LNCS, pp. 54–69, 2025, doi: 10.1007/978-3-031-72643-9_4.
Source
Computer Vision – ECCV 2024
Conference
Keywords
Efficient Inference, Vision Instruction-Following Model
Subjects
Source
Publisher
Springer Nature
