Loading...
Thumbnail Image
Item

Unsupervised Adaptation of Vision-Language Models

Ali, Eman Gouda Abdelmoaty
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Dissertation
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent advances in largescale pretrained visionlanguage models, such as CLIP, have sig nificantly improved image-text relationship modeling. By jointly training visual and textual encoders on web-scale image-text data, CLIP excels at performing a wide range of zero-shot image classification tasks. However, its performance is hindered by the distribution gap between the pretraining data and target domain images. While few-shot labeled samples can help adapt CLIP, they require extensive labeling, which limits scal ability, particularly for datasets with a large number of classes. Unsupervised learning, by leveraging CLIP’s zero-shot capability to generate pseudo-labels, enables adaptation using unlabeled samples. Despite this, several challenges persist, including limited unla beled target samples, domain gaps leading to inaccurate pseudolabels, and difficulties in fine-grained classification, where subtle class differences result in noisy pseudo-labels. This study addresses these challenges by proposing four innovative approaches to enhance the unsupervised adaptation of CLIP. We introduce the Noise-tolerant Unsupervised Adapter (NtUA), which enables robust CLIP adaptation with few unlabeled target samples. NtUA employs a weighted key-value cache while integrating a noise rectification technique that refines pseudo-labels using CLIP-distilled knowledge. This iterative update process en hances adaptation, making NtUA effective for downstream classification tasks with limited target data. Second, we introduce Dual Prototypes Alignment (DPA) to improve CLIP’s adaptation when unlabeled target data come from different domains. DPA addresses visual-textual misalignment by using dual prototypes as classifiers and combining their outputs for more accurate pseudo-labels. By aligning textual and image prototypes, DPA further mitigates misalignment, enhancing adaptation performance. Third, we introduce Adaptive Pseudo Labeling via Prototype Consistency and Neighborhood Awareness (ALPHA), a method designed to refine the noisy pseudo-labels generated by CLIP due to the domain gap. ALPHA combines two components: PICS, which estimates pseudo-label accuracy by assessing an image embedding’s similarity to its class prototype and its distinction from crossclass samples, to filter out noisy pseudolabels, and NALR, which refines pseudolabels by leveraging semantic consistency among neighboring samples. Fi nally, we introduce Fine-grained Alignment and Interaction Refinement (FAIR) to improve CLIP’s performance in finegrained classification during unsupervised adaptation. FAIR enhances feature discrimination by aligning localized image features with language embed dings, boosting crossmodal interactions for better selftraining in unsupervised adaptation. Experiments with our four proposed methods demonstrate their effectiveness in overcom ing CLIP’s limitations in unsupervised adaptation, advancing the state of the art. Beyond improving robustness to label scarcity, pseudo-label noise, and fine-grained classification, our approach provides a scalable framework for adapting vision-language models across diverse domains, supporting future research in the unsupervised adaptation of CLIP.
Citation
Eman Gouda Abdelmoaty Ali, “Unsupervised Adaptation of Vision-Language Models,” Doctor of Philosophy thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Unsupervised Adaptation, Vision-Language Models, Noise-Robust Learning
Subjects
Source
Publisher
DOI
Full-text link