Loading...
Thumbnail Image
Item

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Thawakar, Omkar
Dissanayake, Dinura
More, Ketan Pravin
Thawkar, Ritesh
Heakl, Ahmed
Ahsan, Noor
Li, Yuhao
Zumri, Ilmuz Zaman Mohammed
Lahoud, Jean
Anwer, Rao Muhammad
... show 5 more
Research Projects
Organizational Units
Journal Issue
Abstract
Step-by-step reasoning is crucial for solving complex visual tasks, yet existing approaches lack a comprehensive framework for evaluating this capability and do not emphasize stepwise problem-solving. To this end, we propose a comprehensive framework for advancing multi-step visual reasoning in large multimodal models (LMMs) through three key contributions. First, we introduce a Visual Reasoning Chain Benchmark (VRC-Bench), a comprehensive benchmark for multi-step visual reasoning, covering eight diverse categories and over 4k verified reasoning steps to rigorously evaluate LLMs' ability to reason accurately and interpretably across multiple steps. Second, we propose a fine-grained visual reasoning metric that evaluates correctness and logical coherence at each step, providing deeper insights beyond traditional accuracy metrics. Third, we introduce LlamaV-o1, a state-of-the-art multimodal step-by-step reasoning model trained using a multi-step curriculum learning approach. LlamaV-o1 is optimized for structured step-by-step reasoning Our LlamaV-o1 obtains a significant gain of around 9% averaged across six benchmarks compared to the baseline, thereby demonstrating the impact of introducing the proposed step-by-step visual reasoning. Further, it outperforms the recent Llava-CoT with an absolute gain of 3.8% averaged across six benchmarks, while being 5× faster during inference scaling. On the VRC-Bench, LlamaV-o1 achieves the best performance among all open-source reasoning LMMs in terms of both final accuracy and steps. Our benchmark, model, and code is available at https://github.com/mbzuai-oryx/LlamaV-o1.
Citation
O. Thawakar, D. Dissanayake, K.P. More, R. Thawkar, A. Heakl, N. Ahsan, Y. Li, I.Z.M. Zumri, J. Lahoud, R.M. Anwer, H. Cholakkal, I. Laptev, M. Shah, F.S. Khan, S. Khan, "LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs," 2025, pp. 24290-24315.
Source
Findings of the Association for Computational Linguistics: ACL 2025
Conference
Findings of the Association for Computational Linguistics: ACL 2025
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: ACL 2025
Publisher
Association for Computational Linguistics
Full-text link