Entropy-Guided Condensing for Vision Transformer
Lin, Sihao ; Lyu, Pumeng ; Liu, Dongrui ; Li, Zhihui ; Wang, Wenguan ; Chang, Xiaojun ; Zheng, Yuhui
Lin, Sihao
Lyu, Pumeng
Liu, Dongrui
Li, Zhihui
Wang, Wenguan
Chang, Xiaojun
Zheng, Yuhui
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent success in the self-attention mechanism for the vision domain underscores the need for efficient vision transformers (ViTs). This work investigates the layer-wise learning capacity of ViT and aims to condense it along depth dimension by removing the uninformative layers, guided by transfer entropy. As an initial exploration, we inspect the condensation within a transformer block. Specifically, we identify that the MLP layer can elicit entropy on par with the attention layer within a block and some MLPs may be underutilized given low entropy. Therefore, we are motivated to integrate non-essential attention layers into their MLP counterparts by degenerating them into identical mapping, referred to as Dilution Learning, where a sparse mask is applied to the attention layer and decays during training. Although dilution learning is verified on a series of ViT architectures, it has shown instability in scale-enhanced ViT, such as DeiT-L, as the learnable scale is difficult to converge. The issue stems from the coupling of the decaying sparse mask with the unbounded learnable scale in the attention layers, making it difficult to be jointly optimized. To mitigate this problem, we use a simplified optimization strategy that alternatively optimizes the learnable scale and the sparse mask. In this way, we decouple their learning process and stabilize the training of scale-enhanced ViT. Additionally, our new approach can augment the previous layer-wise condensation to block-wise level, further enhancing efficiency. Our model series demonstrates superior results on a variety of vision tasks and benchmarks. For example, our method removes 50% attention layers or 30% transformer blocks of DeiT-B without performance compromise on ImageNet-1k.
Citation
S. Lin, P. Lyu, D. Liu, Z. Li, W. Wang, X. Chang, Y. Zheng, "Entropy-Guided Condensing for Vision Transformer," International Journal of Computer Vision, vol. 134, no. 3, pp. 86-86, 2026, https://doi.org/10.1007/s11263-026-02753-y.
Source
International Journal of Computer Vision
Conference
Keywords
46 Information and Computing Sciences, 4611 Machine Learning
Subjects
Source
Publisher
Springer Nature
