Item

On Culturally-diverse Multilingual Video Large Multimodal Models

Shafique, Bhuiyan Sanjid
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large multimodal models (LMMs) have recently gained attention due to their effective ness to understand and generate descriptions of visual content. Most existing LMMs are in English language. While few recent works explore multilingual image LMMs, to the best of our knowledge, moving beyond the English language is yet to be investigated in the context of video LMMs. In pursuit of more inclusive video LMMs, we propose a machine translated multilingual video training set comprising 1.2 million samples, in 14 languages, and develop a simple yet effective multilingual video LMM. Our training set consists of both low and high-resource languages: English, Chinese, Spanish, French, German, Hindi, Arabic, Russian, Bengali, Urdu, Sinhala, Tamil, Swedish, and Japanese. Our proposed model, ViMUL LLM, is shown to provide a better trade-off between highand lowresource languages for video understanding. Additionally, we also introduce a multilingual Video LMM benchmark, named ViMUL-Bench, to evaluate Video LMMs across the same 14 lan guages. Our ViMUL-Bench is designed to rigorously test video LMMs across 15 categories including eight culturally diverse categories. ViMUL-Bench comprises both open-ended and multiple-choice questions spanning various video durations (short, medium, and long) with 8k samples that are manually verified. We hope the ViMUL-Bench and ViMUL-LMM, along with a largescale multilingual video training set, will help ease future research in developing cultural and linguistic inclusive multilingual video LMMs.
Citation
Bhuiyan Sanjid Shafique, “On Culturally-diverse Multilingual Video Large Multimodal Models,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Multilingual, Cross-Cultural, Large Language Model (LLM), Multimodal
Subjects
Source
Publisher
DOI
Full-text link