Palo: A polyglot large multimodal model for 5b people
Rasheed, Hanoona ; Maaz, Muhammad ; Shaker, Abdelrahman ; Khan, Salman ; Cholakal, Hisham ; Anwer, Rao M. ; Baldwin, Tim ; Felsberg, Michael ; Khan, Fahad S.
Rasheed, Hanoona
Maaz, Muhammad
Shaker, Abdelrahman
Khan, Salman
Cholakal, Hisham
Anwer, Rao M.
Baldwin, Tim
Felsberg, Michael
Khan, Fahad S.
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palooffers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of ~5B people (65% of the world population). Our approach involves a semiautomated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages. Code: https://github.com/mbzuai-oryx/PALO.
Citation
H. Rasheed et al., “Palo: A Polyglot Large Multimodal Model for 5B People,” 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 1745–1754, Feb. 2025, doi: 10.1109/WACV61041.2025.00177.
Source
2025 IEEE/CVF Winter Conference on Applications of Computer Vision
Conference
Keywords
Visualization, Translation, Codes, Scalability, Large language models, Instruction sets, Oral communication, Manuals, Cognition, Multilingual
Subjects
Source
Publisher
IEEE
