Item

CONCAP: Seeing Beyond English with Retrieval-Augmented Captioning

Ibrahim, George Sherif Botros
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Multilingual vision-language models have made significant strides in image captioning, yet they still lag behind their English counterparts due to limited multilingual training data and costly large-scale model parameterization. Retrieval-augmented generation (RAG) offers a promising alternative by conditioning caption generation on retrieved examples in the target language, reducing the need for extensive multilingual training. However, multilingual RAG captioning models often depend on retrieved captions translated from English, which can introduce mismatches and linguistic biases relative to the source language. We introduce CONCAP, a multilingual image captioning model that integrates retrieved captions with image-specific concepts, enhancing the contextualization of the input image and grounding the captioning process across different languages. Experiments on the XM3600 dataset indicate that CONCAP enables strong performance with significantly reduced data requirements. Additionally, evaluations on a culturally focused Arabic dataset show that conceptaware captioning improves cultural relevance, though models trained directly on cultural data still maintain an edge. Our findings highlight the effectiveness of concept aware retrieval augmentation in bridging multilingual performance gaps and emphasize the need for culturally informed modeling in broader multilingual applications.
Citation
George Sherif Botros Ibrahim, “CONCAP: Seeing Beyond English with Retrieval-Augmented Captioning,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Vision-Language Models, Culture, Retrieval Augmented Generation, RAG, Concepts, Captions
Subjects
Source
Publisher
DOI
Full-text link