Rethink and Rearrange: Iterative Multimodal Reasoning for Enhanced Medical QA via Multimodal Retrieval-Augmented Generation
Wang, Jinhong
Wang, Jinhong
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Since late 2022, Large Language Models (LLMs) have showcased exceptional capabilities in natural language processing, conversational fluency, and contextual understanding, fundamentally transforming industrial production and everyday life. These advancements continue to evolve, driving significant societal and technological impacts. Recent developments in Multimodal Large Language Models (MLLMs) have markedly improved their accuracy in multimodal comprehension and professional content generation. The integration of advanced visual alignment modules with robust foundational language models has substantially enhanced MLLMs’ performance in cross-modal understanding and reasoning, enabling more seamless interactions between text, images, and other data types. Concurrently, the introduction of deep thinking models, leveraging Thought Preference Optimization (TPO), has enabled self-verification during reasoning processes. By incorporating techniques such as Chain-of-Thought (CoT) prompting and self-reinforcement, these models ensure greater accuracy through rigorous logical validation, setting a new standard for reliable outputs. A key challenge in developing MLLMs for medical imaging lies in the complexity of integrating heterogeneous multimodal data and the need for precise domain-specific interpretation. While addressing issues like noise, variability in imaging quality, traditional models often struggle with overfitting to limited datasets or fail to correctly interpret medical queries. In this work, we propose RTRA-MRAG, a compact yet powerful MLLM inspired by deep thinking paradigms, designed with superior multimodal fusion capabilities. The Rethinking and Rearrangement (RTRA) pipeline refines output embeddings by assessing the accuracy of generated text, utilizing Multimodal Retrieval Augmented Generation (RAG) information to iteratively refine responses through multiple reasoning cycles. The RAG model grounded in both online and prebuiltoffline image embeddings and a medical knowl edge base, to deliver highly reliable answers informed by multimodal data. RTRA-MRAG overcomes the hurdles stated above, through its adaptive multimodal fusion and iterative rethinking process, enabling utilizing of rich but complex data source. The RtRA pipeline further enhances performance by driving the model to output and follow chain-of-thought, leading to more accurate reasoning while handling challenging medical contexts. In experimental evaluations, RTRA-MRAG was benchmarked against mainstream medical VLLMs and longcontext reasoning LLMs. Our model demonstrated a significant im provement in accuracy, outperforming similarly sized LLMs and establishing its efficacy in multimodal medical applications. The value of RTRA-MRAG are further discussed within the Extension Research part, indicating potential performance of RTRA-MRAG
Citation
Jinhong Wang, “Rethink and Rearrange: Iterative Multimodal Reasoning for Enhanced Medical QA via Multimodal Retrieval-Augmented Generation,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Large Language Model (LLM), Retrieval Augmented Generation, Knowledge Distillation, Supervised Fine Tuning, Chain of Thought
