Loading...
Thumbnail Image
Item

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zuhri, Zayd Muhammad Kawakibi
Adilazuarda, Muhammad Farid
Purwarianti, Ayu
Aji, Alham Fikri
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV’s potential for efficient deployment of transformer models at scale.
Citation
Z.M.K. Zuhri, M.F. Adilazuarda, A. Purwarianti, A.F. Aji, "MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding," 2025, pp. 5516-5525.
Source
Findings of the Association for Computational Linguistics: NAACL 2025
Conference
Findings of the Association for Computational Linguistics: NAACL 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: NAACL 2025
Publisher
Association for Computational Linguistics
Full-text link