Item

DHVT: Dynamic Hybrid Vision Transformer for Small Dataset Recognition

Lu, Zhiying
Liu, Chuanbin
Chang, Xiaojun
Zhang, Yongdong
Xie, Hongtao
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) persists due to the lack of inductive bias, notably when training from scratch with limited datasets. This paper identifies two crucial shortcomings in ViTs: spatial relevance and diverse channel representation. Thus, ViTs struggle to grasp fine-grained spatial features and robust channel representation due to insufficient data. We propose the Dynamic Hybrid Vision Transformer (DHVT) to address these challenges. Regarding the spatial aspect, DHVT introduces convolution in the feature embedding phase and feature projection modules to enhance spatial relevance. Regarding the channel aspect, the dynamic aggregation mechanism and a groundbreaking design “head token” facilitate the recalibration and harmonization of disparate channel representations. Moreover, we investigate the choices of the network meta-structure and adopt the optimal multi-stage hybrid structure without the conventional class token. The methods are then modified with a novel dimensional variable residual connection mechanism to leverage the potential of the structure sufficiently. This updated variant, called DHVT2, offers a more computationally efficient solution for vision-related tasks. DHVT and DHVT2 achieve state-of-the-art image recognition results, effectively bridging the performance gap between CNNs and ViTs. The downstream experiments further demonstrate their strong generalization capacities. The codes are available at https://github.com/ArieSeirack/DHVT.
Citation
Z. Lu, C. Liu, X. Chang, Y. Zhang and H. Xie, "DHVT: Dynamic Hybrid Vision Transformer for Small Dataset Recognition," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3528228.
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
Head, Transformers, Training, Computer vision, Magnetic heads, Convolution, Image recognition, Data models, Computer architecture, Convolutional neural networks
Subjects
Source
Publisher
IEEE
Full-text link