Item

Overview of ImageCLEF 2025 – Multimodal Reasoning

Dimitrov, Dimitar Iliyanov
Hee, Mingshan
Xie, Zhuohan
Das, Rocktim Jyoti
Ahsan, Momina
Ahmad, Sarfraz
Paev, Nikolay
Koychev, Ivan Kolev
Nakov, Preslav Ivanov
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We present an overview of the first edition of the ImageCLEF Multimodal Reasoning Lab at the 2025 iteration of the Conference and Labs of the Evaluation Forum (CLEF). The goal of the task is to evaluate how well vision-language models can reason over complex visual and textual examination material. The test dataset consists of 3,565 questions in 13 different languages. Participants received an image of a question, which included answer choices and metadata outlining the nature of the visual content within the image. Their objective was to choose one correct answer from a group of three to five options. The task had moderate participation with a total of 51 registered teams. Of these, 11 teams submitted results on the test set across all 13 languages and the multilingual leaderboard, with 129 graded submissions overall. The teams mainly used zero-shot approaches, while some chose few-shot methods or fine-tuning. Qwen-VL was the most commonly used model, followed by Gemini. Participants focused on prompt engineering, mostly using variations of instruction prompts that guided the models through processing steps to reach a final answer. Some teams approached the task from an optimization perspective, showing that well-optimized models can achieve competitive performance with fewer parameters and faster inference times. This task contributes to the broader effort of expanding resources for vision-language reasoning evaluation, particularly in low-resource languages. The dataset has been publicly released, along with the gold labels for the test set. We hope this resource will support future research on multilingual and multimodal understanding and foster the development of better and more efficient vision-language models.
Citation
D. Dimitrov, M. S. Hee, Z. Xie, R. J. Das, and M. Ahsan, “Overview of ImageCLEF 2025 – Multimodal Reasoning,” CEUR Workshop Proc, 2025, Accessed: Oct. 28, 2025
Source
CEUR Workshop Proceedings
Conference
26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025
Keywords
Subjects
Source
26th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF 2025
Publisher
CEUR-WS
DOI
Full-text link