Loading...
LLMs Can Compensate for Deficiencies in Visual Representations
Takishita, Sho ; Gala, Jay ; Mohamed, Abdelrahman ; Inui, Kentaro ; Kementchedjhieva, Yova
Takishita, Sho
Gala, Jay
Mohamed, Abdelrahman
Inui, Kentaro
Kementchedjhieva, Yova
Files
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Many vision-language models (VLMs) that prove very effective at a range of multimodal task, build on CLIP-based vision encoders, which are known to have various limitations. We investigate the hypothesis that the strong language backbone in VLMs compensates for possibly weak visual features by contextualizing or enriching them. Using three CLIP-based VLMs, we perform controlled self-attention ablations on a carefully designed probing task. Our findings show that despite known limitations, CLIP visual representations offer ready-to-read semantic information to the language decoder. However, in scenarios of reduced contextualization in the visual representations, the language decoder can largely compensate for the deficiency and recover performance. This suggests a dynamic division of labor in VLMs and motivates future architectures that offload more visual processing to the language decoder.
Citation
S. Takishita, J. Gala, A. Mohamed, K. Inui, Y. Kementchedjhieva, "LLMs Can Compensate for Deficiencies in Visual Representations," 2025, pp. 15253-15272.
Source
Conference
Findings of the Association for Computational Linguistics: EMNLP 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: EMNLP 2025
Publisher
Association for Computational Linguistics (ACL)
