Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach

Ali, Mohsin; Raza, Haider; Gan, John Q.; Khan, Muhammad Haris

Item

Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach

Ali, Mohsin
;
Raza, Haider
;
Gan, John Q.
;
Khan, Muhammad Haris

Author

Ali, Mohsin , Raza, Haider , Gan, John Q. , Khan, Muhammad Haris

Department

Computer Vision

Type

Conference proceeding

Date

2025

Language

English

Collections

Publications

Show all metadata

Abstract

Vision Transformers (ViTs) are well-known for capturing the global context of images using Multi-head Self-Attention (MHSA). However, compared to Convolutional Neural Networks (CNNs), ViTs typically exhibit a reduced inductive bias and require a larger volume of training image data to learn local feature representations. While various methods like the integration of CNN features or advanced pre-training strategies have been proposed to introduce this inductive bias, they often require significant architectural modifications or rely heavily on expansive pre-training datasets. This paper introduces a novel approach for training ViTs on limited datasets without altering the ViT architecture. We propose the Multi-Gradient Image Transformer (MGiT), which utilizes a parallel training method with a compact auxiliary ViT to adaptively optimize the weights of the target ViT. This approach yields significant performance improvements across diverse datasets and training scenarios. Our findings demonstrate that MGiT enhances ViT efficiency more effectively than traditional training methods. Furthermore, the application of Jensen-Shannon (JS) Divergence validates the convergence and alignment of feature understanding between the primary and auxiliary ViTs, thereby stabilizing the training process. The code is available at https://github.com/game-sys/Multi-Gradient-Image-Transformer-MGiT-

Citation

M. Ali, H. Raza, J. Q. Gan and M. Haris, "Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach," in 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Nashville, TN, USA, 2025, pp. 693-702, doi: 10.1109/CVPRW67362.2025.00074.

Source

IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops

Conference

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025

Source

2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025

Publisher

IEEE

DOI

10.1109/CVPRW67362.2025.00074

Additional links

https://www.computer.org/csdl/proceedings-article/cvprw/2025/999400a693/2a1UemNgkus

Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach

Ali, Mohsin
;
Raza, Haider
;
Gan, John Q.
;
Khan, Muhammad Haris

Author

Supervisor

Department

Embargo End Date

Type

Date

License

Language

Collections

Research Projects

Organizational Units

Journal Issue

Abstract

Co-author(s)

Citation

Source

Conference

Keywords

Subjects

Source

Publisher

DOI

Additional links

Full-text link

Optimising Vision Transformer Performance on Limited Datasets: A Multi-Gradient Approach

Ali, Mohsin ; Raza, Haider ; Gan, John Q. ; Khan, Muhammad Haris

Author

Supervisor

Department

Embargo End Date

Type

Date

License

Language

Collections

Research Projects

Organizational Units

Journal Issue

Abstract

Co-author(s)

Citation

Source

Conference

Keywords

Subjects

Source

Publisher

DOI

Additional links

Full-text link

Ali, Mohsin
;
Raza, Haider
;
Gan, John Q.
;
Khan, Muhammad Haris