On Enabling Foundation Models for Consistent Vision-Language Generation and Advanced Medical Segmentation Adaptation
Hashmi, Sarim
Hashmi, Sarim
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large foundation models have revolutionized both multimodal vision-language tasks and medical image segmentation by demonstrating remarkable versatility across a broad spectrum of applications. However, despite their success, these models face critical challenges that hinder their deployment in safety-critical domains. On one hand, vision-language models (VLMs) are prone to hallucination and exhibit high sensitivity to minor prompt variations, such as spelling errors and paraphrasing differences. To mitigate these issues, we propose a novel loss formulation that enforces semantic consistency and context grounding. By integrating tailored augmentations that mimic real-world linguistic variability, our fine tuning strategy not only reduces the generation of inaccurate content but also stabilizes model outputs. Extensive experiments on the IUxray dataset reveal that this approach improves both accuracy and semantic fidelity, ensuring that the models remain reliable even under diverse linguistic inputs. On the other hand, the intricate demands of medical image segmentation require models to capture detailed, domain-specific features efficiently. Conventional fine-tuning methods either suffer from underfitting when using low-rank adaptations or lack flexibility when employing full-rank updates. To address this, we introduce SALT (Singular Value Adap tation with Low Rank Transformation), a hybrid parameterefficient finetuning method. SALT selectively adapts the most influential singular values with trainable scale and shift parameters while simultaneously applying lowrank updates to the remaining subspace. This balanced approach harnesses the strengths of both LoRA and fullrank SVD-based methods, leading to robust performance across a range of datasets from scenarios with as few as 20 samples to those with 1000. Our evaluations across five challenging medical datasets demonstrate that SALT achieves a consistent improvement in Dice scores by 2% to 5%, all while using only a fraction (3.9%) of the trainable parameters compared to traditional methods. Collectively, these advancements significantly enhance the reliability, efficiency, and safety of foundation models. By addressing the dual challenges of hallucination in multimodal VLMs and domain-specific adaptation in medical segmentation, our work lays a strong foundation for the broader adoption of robust, context-aware learning in both general and clinical environments.
Citation
Sarim Hashmi, “On Enabling Foundation Models for Consistent Vision-Language Generation and Advanced Medical Segmentation Adaptation,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Vision Language Models, Segmentation, Medical Imaging
