Loading...
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zuhri, Zayd Muhammad Kawakibi ; Adilazuarda, Muhammad Farid ; Purwarianti, Ayu ; Aji, Alham Fikri
Zuhri, Zayd Muhammad Kawakibi
Adilazuarda, Muhammad Farid
Purwarianti, Ayu
Aji, Alham Fikri
Files
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV’s potential for efficient deployment of transformer models at scale.
Citation
Z.M.K. Zuhri, M.F. Adilazuarda, A. Purwarianti, A.F. Aji, "MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding," 2025, pp. 5516-5525.
Source
Findings of the Association for Computational Linguistics: NAACL 2025
Conference
Findings of the Association for Computational Linguistics: NAACL 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: NAACL 2025
Publisher
Association for Computational Linguistics
