Loading...
Alignclip: navigating the misalignments for robust vision-language generalization
Han, Zhongyi ; Luo, Gongxu ; Sun, Hao ; Li, Yaqian ; Han, Bo ; Gong, Mingming ; Zhang, Kun ; Liu, Tongliang
Han, Zhongyi
Luo, Gongxu
Sun, Hao
Li, Yaqian
Han, Bo
Gong, Mingming
Zhang, Kun
Liu, Tongliang
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In the realm of Vision-Language Pretraining models, achieving robust and adaptive representations is a cornerstone for successfully handling the unpredictability of real-world scenarios. This paper delves into two pivotal misalignment challenges inherent to Contrastive Language-Image Pre-training (CLIP) models: attention misalignment, which leads to an overemphasis on background elements rather than salient objects, and predictive category misalignment, characterized by the model’s struggle to discern between classes based on similarity. These misalignments undermine the representational stability essential for dynamic, real-world applications. To address these challenges, we propose AlignCLIP, an advanced fine-tuning methodology distinguished by its attention alignment loss, designed to calibrate the distribution of attention across multi-head attention layers. Furthermore, AlignCLIP introduces semantic label smoothing, a technique that leverages textual class similarities to refine prediction hierarchies. Through comprehensive experimentation on a variety of datasets and in scenarios involving distribution shifts and unseen classes, we demonstrate that AlignCLIP significantly enhances the stability of representations and shows superior generalization capabilities.
Citation
Z. Han et al., “Alignclip: navigating the misalignments for robust vision-language generalization,” Machine learning, vol. 114, no. 3, pp. 58-, 2025, doi: 10.1007/s10994-025-06742-z
Source
Machine Learning
Conference
Keywords
Vision-language pretraining, Attention alignment, Semantic label smoothing, Domain generalization, Class representation stability
Subjects
Source
Publisher
Springer Nature
