SmartCLIP: Modular Vision-language Alignment with Identification Guarantees
Xie, Shaoan ; Lingjing, Lingjing ; Zheng, Yujia ; Yao, Yu ; Tang, Zeyu ; Xing, Eric P. ; Chen, Guangyi ; Zhang, Kun
Xie, Shaoan
Lingjing, Lingjing
Zheng, Yujia
Yao, Yu
Tang, Zeyu
Xing, Eric P.
Chen, Guangyi
Zhang, Kun
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Contrastive Language-Image Pre-training (CLIP) \citep radford2021learning has emerged as a pivotal model in computer vision and multimodal learning, achieving state-of-the-art performance at aligning visual and textual representations through contrastive learning. However, CLIP struggles with potential information misalignment in many image-text datasets and suffers from entangled representation. On the one hand, short captions for a single image in datasets like MSCOCO may describe disjoint regions in the image, leaving the model uncertain about which visual features to retain or disregard. On the other hand, directly aligning long captions with images can lead to the retention of entangled details, preventing the model from learning disentangled, atomic concepts -- ultimately limiting its generalization on certain downstream tasks involving short prompts. In this paper, we establish theoretical conditions that enable flexible alignment between textual and visual representations across varying levels of granularity. Specifically, our framework ensures that a model can not only preserve cross-modal semantic information in its entirety but also disentangle visual representations to capture fine-grained textual concepts. Building on this foundation, we introduce \ours, a novel approach that identifies and aligns the most relevant visual and textual representations in a modular manner. Superior performance across various tasks demonstrates its capability to handle information misalignment and supports our identification theory. The code is available at https://github.com/Mid-Push/SmartCLIP.
Citation
S. Xie et al., “SmartCLIP: Modular Vision-language Alignment with Identification Guarantees,” 2025. Accessed: Jun. 24, 2025. [Online]. Available: https://github.com/Mid-
Source
Proceedings of the Computer Vision and Pattern Recognition Conference
Conference
Computer Vision and Pattern Recognition Conference (CVPR), 2025
Keywords
Subjects
Source
Computer Vision and Pattern Recognition Conference (CVPR), 2025
Publisher
Computer Vision Foundation
