Item

ContextDriӀ at ImageCLEF 2025 Multimodal Reasoning: Evaluating VLMs’ Multimodal, Multilingual and Multidomain Reasoning Capabilities via Thinking Budget Variations and Textual Augmentation

Krazheva, Vasilena T.
Markova, Diana
Dimitrov, Dimitar I.
Koychev, Ivan
Nakov, Preslav
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
With the growing capabilities of vision-language models (VLMs), current systems achieve impressive performance on tasks requiring the integration of vision and language, such as image captioning, simple visual question answering, and visual dialogue. However, it is offien claimed that these models fall short when deeper reasoning is required. In this paper, we investigate this claim through the ImageCLEF 2025 MultimodalReasoning task, which challenges models to solve multiple-choice questions in image format across a number of subjects and languages. Using Gemini 2.0 Flash and 2.5 Flash, we study the effect of reasoning capacity and budget, external textual transcription, and prompt design on the EXAMS-V benchmark for Bulgarian and English. Our results indicate that, contrary to expectation, VLMs can perform remarkably well on multimodal reasoning tasks in both languages. In particular, they are able to solve tasks in Physics and Science with an accuracy of over 80%. We identify thinking budget as the main contributing factor. Additionally, we demonstrate a setting where unconstrained thinking budget might deteriorate performance in Biology and Chemistry. The system submitted ranked first on English and Bulgarian leaderboards with respective 89.65% and 90.50% accuracy scores.
Citation
V. T. Krazheva, D. Markova, D. I. Dimitrov, I. Koychev, and P. Nakov, “ContextDriӀ at ImageCLEF 2025 Multimodal Reasoning: Evaluating VLMs’ Multimodal, Multilingual and Multidomain Reasoning Capabilities via Thinking Budget Variations and Textual Augmentation,” CEUR Workshop Proc, 2025
Source
CEUR Workshop Proceedings
Conference
26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025
Keywords
Gemini, Multimodal Reasoning, Optical Character Recognition, Vision-Language Model, Visual Question Answering
Subjects
Source
26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025
Publisher
CEUR-WS
DOI
Full-text link