Sensitivity Analysis of Vision Large Language Models (VLLMs)
Ismithdeen, Mohamed Insaf
Ismithdeen, Mohamed Insaf
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent advancements in Vision Large Language Models (VLLMs) have led to their increasing adoption in a variety of real-world applications, where they are primarily engaged through natural language prompts. These models, which combine visual understanding with powerful language capabilities, span both open-source and proprietary variants, each designed with different training objectives, capabilities, and alignment strategies. Despite their popularity, there remains a gap in understanding how to effectively prompt these models, especially in the context of multiple-choice question answering (MCQA) tasks used for evaluation. In this work, we investigate the sensitivity of VLLMs to prompt variation across image and video-based MCQA benchmarks. We evaluate 10 VLLMs, ranging from lightweight open-source models to proprietary models like GPT-4o and Gemini 1.5 Pro on three widely-used benchmarks: MMStar, MMMU-Pro, and MVBench. Using a controlled suite of 61 prompts categorized into 15 types and 6 broader supercategories, we examine the performance fluctuations caused solely by prompt wording.Our results reveal that proprietary models, while generally achieving higher accuracy, exhibit greater sensitivity to prompt phrasing likely due to their stronger alignment with instruction semantics. In contrast, open-source models show lower prompt sensitivity but often fail to leverage nuanced or indirect instructions. Through model-, dataset-, and prompt-level analyses, we highlight critical failure cases and propose a set of prompting principles tailored separately for open-source and proprietary models. Our findings provide actionable insights into prompt robustness and inform best practices for reliable and reproducible evaluation of VLLMs in MCQA settings.
Citation
Mohamed Insaf Ismithdeen, “Sensitivity Analysis of Vision Large Language Models (VLLMs),” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Large Multimodal Models, Prompt Engineering, Multimodal, Visual Question Answering, Prompting Principles
