Item

How Good is my Video-LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs

Khattak, Muhammad Uzair
Naeem, Muhammad Ferjad
Hassan, Jameel
Naseer, Muhammad Muzammal
Tombari, Federico
Khan, Fahad Shahbaz
Khan, Salman Hameed
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent advancements in Large Language Models (LLMs) have led to the development of Video Large Multi-modal Models (Video-LMMs) that can handle a wide range of video understanding tasks. These models have the potential to be deployed in real-world applications such as robotics, AI assistants, medical surgery, and autonomous vehicles. The widespread adoption of Video-LMMs in our daily lives underscores the importance of ensuring and evaluating their robust performance in mirroring human-like reasoning and interaction capabilities in complex, real-world contexts. However, existing benchmarks for Video-LMMs primarily focus on general video comprehension abilities and neglect assessing their reasoning capabilities over complex videos in the real-world context, and the robustness of these models through the lens of user prompts as text queries. In this paper, we present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES), a novel benchmark that comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. We evaluate 11 recent models, including both open-source and closed-source variants, and find that most of the Video-LMMs, especially open-source ones, struggle with robustness and reasoning when dealing with complex videos. Based on our analysis, we develop a training-free Dual-Step Contextual Prompting (DSCP) technique to effectively enhance the performance of existing Video-LMMs on CVRR-ES benchmark. Our findings provide valuable in-sights for building the next generation of human-centric AI systems with advanced robustness and reasoning capabilities. Our dataset and code will be made publicly available.
Citation
M. U. Khattak et al., “How Good is my Video-LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs,” 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3642–3651, Jun. 2025, doi: 10.1109/CVPRW67362.2025.00349
Source
IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
Conference
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
Keywords
Foundational model benchmarking, Multi-modal learning, Video understanding, Video-lmms, Visual reasoning and robustness
Subjects
Source
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPRW 2025
Publisher
IEEE
Full-text link