Attention-Guided Vision-Language Modeling for Image and Video Captioning
Gawfan, Nasser Aref Mohsen
Gawfan, Nasser Aref Mohsen
Author
Supervisor
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Image captioning is a fundamental task between computer vision and natural language processing that aims to generate semantically coherent and contextually rich textual descriptions of visual content. Although there has been progress in vision-language models (VLMs), current approaches face constraints in terms of fine-grained object recognition, scene context comprehension, and relational reasoning among objects. However, these shortcomings lead to ambiguities in the captions produced and do not generalize to unseen objects. In this thesis, we present an attention-based image captioning framework consisting of three types of attention modules, Scene Attention, Object Attention and Graph interaction, which enhance the performance across both the global and local levels. Our model takes visual embeddings from BLIP as prefix and implements finetuning on the GPT2 language model. In particular, Scene Attention helps capture global contextual information, Object Attention aids in distinguishing object-specific features, and in contrast, Graph Interaction Attention boosts relations reasoning among objects, leading to more accurate and informative captions. In order to adapt our architecture for the video captioning task, we add additional models for temporal attention, to account for temporal dependencies over frames. Our model generalizes well in both indomain and crossdomain video settings, even when trained solely with visual features and without reliance on extensive multimodal pretraining. Extensive evaluations on standard video captioning benchmarks validate the effectiveness and robustness of our approach.
Citation
Nasser Aref Mohsen Gawfan, “Attention-Guided Vision-Language Modeling for Image and Video Captioning,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Artificial Intelligence, Vision-Language Model, Vision Transformer
