Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score
Ali, Eman ; Silva, Sathira ; Arora, Chetan ; Khan, Muhammad Haris
Ali, Eman
Silva, Sathira
Arora, Chetan
Khan, Muhammad Haris
Supervisor
Department
Computer Vision
Embargo End Date
Type
Workshop
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Vision-language models (VLMs) like CLIP excel in zeroshot learning by aligning image and text representations through contrastive pretraining. Existing approaches to unsupervised adaptation (UA) for fine-grained classification with VLMs either rely on fixed alignment scores that may not capture evolving, subtle class distinctions or on computationally expensive pseudo-labeling strategies that limit scalability. In contrast, we show that modeling fine-grained cross-modal interactions during adaptation produces more accurate, class-discriminative pseudo-labels and substantially improves performance over state-of-the-art (SOTA) methods. We introduce Fine-grained Alignment and Interaction Refinement (FAIR), an innovative approach that dynamically aligns localized image features with descriptive language embeddings through a set of Class Description Anchors (CDA). This enables the definition of a Learned Alignment Score (LAS), which incorporates CDA as an adaptive classifier, facilitating cross-modal interactions to improve self-training in unsupervised adaptation. Furthermore, we propose a self-training weighting mechanism designed to refine pseudo-labels in the presence of inter-class ambiguities. Our approach, FAIR, delivers a substantial performance boost in fine-grained unsupervised adaptation, achieving a notable overall gain of 2.78% across 13 fine-grained datasets compared to SOTA methods.1
Citation
E. Ali, S. Silva, C. Arora, M.H. Khan, "Towards Fine-Grained Adaptation of CLIP via a Self-Trained Alignment Score," 2026, pp. 5875-5885.
Source
2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
Conference
IEEE Workshop on Applications of Computer Vision (WACV)
Keywords
46 Information and Computing Sciences, 4611 Machine Learning
Subjects
Source
IEEE Workshop on Applications of Computer Vision (WACV)
Publisher
Institute of Electrical and Electronics Engineers
