Towards Enhanced Vision-Language Grounding in Video Understanding Large Multimodal Models
Munasinghe, Shehan
Munasinghe, Shehan
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Fine-grained alignment between video content and textual descriptions remains a significant challenge due to the complex spatial and temporal dynamics inherent in video data. While recent video-based Large Multimodal Models (LMMs) demonstrate promising results in general video understanding and dialogue, they often fall short when it comes to precise pixel-level grounding. In this thesis, we propose VideoGLaMM, a novel largescale multimodal framework specifically designed for fine-grained visual grounding in videos based on user-provided text queries. Our architecture integrates a powerful language model with a dual vision encoder - capturing both spatial and temporal featuresand a pixel-level decoder capable of generating accurate segmentation masks. To facilitate this, we intro duce lightweight vision-to-language and language-to-vision adapters, enabling tight vision language alignment. We also construct a comprehensive dataset of over 38k grounded video question-answer pairs, featuring more than 83k objects and 670k pixel-level masks. Through extensive evaluations across tasks such as grounded video conversations, visual grounding, and referring video object segmentation, VideoGLaMM consistently outper forms existing approaches, establishing a new benchmark for fine-grained video-language understanding.
Citation
Shehan Munasinghe, “Towards Enhanced Vision-Language Grounding in Video Understanding Large Multimodal Models,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Video Understanding, Large Multimodal Models, Vision-language Grounding, Large Language Model (LLM)
