Enhance Alignment in Multimodal LLMs via Critic-based Reward Optimization
Wang, Yongxin
Wang, Yongxin
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Multimodal large language models (MLLMs) have achieved remarkable progress on visual question answering and reasoning tasks through instruction fine-tuning on specialized datasets. In this work, we propose Enhancing Alignment in MLLMs via Critic-based reward Optimization (EACO), an ecoefficient approach that aligns MLLMs using self-generated preference data from only 5k images. We train a critic-based evaluation model that assesses model responses across multiple dimensions, enabling the derivation of a nuanced reward signal by differentiating between preferred and nonpreferred outputs. This reward signal guides refined Direct Preference Optimization (DPO) tuning, with an additional supervised fine-tuning stage further enhancing model performance. EACO reduces hallucinations by 65.6% on Hallusion Bench and improves reasoning ability by 21.8% on MME-Cognition, achieving an 8.5% improvement over LLaVA-v1.6-Mistral-7B across multiple benchmarks.
Citation
Yongxin Wang, “Enhance Alignment in Multimodal LLMs via Critic-based Reward Optimization,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Large Language Model (LLM), Multimodal Large Language Model, Reinforcement Learning from Human Feedback
