Loading...
JEEM: Vision-Language Understanding in Four Arabic Dialects
Kadaoui, Karima ; Atwany, Hanin ; Al-Ali, Hamdan ; Mohamed, Abdelrahman ; Mekky, Ali ; Tilga, Sergei ; Fedorova, Natalia ; Artemova, Ekaterina ; Al Darmaki, Hanan ; Kementchedjhieva, Yova
Kadaoui, Karima
Atwany, Hanin
Al-Ali, Hamdan
Mohamed, Abdelrahman
Mekky, Ali
Tilga, Sergei
Fedorova, Natalia
Artemova, Ekaterina
Al Darmaki, Hanan
Kementchedjhieva, Yova
Files
Loading...
2026.findings-eacl.18.pdf
Adobe PDF, 4.99 MB
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We introduce JEEM, a benchmark designed to evaluate Vision-Language Models (VLMs) on visual understanding across four Arabic-speaking countries: Jordan, The Emirates, Egypt, and Morocco. JEEM includes the tasks of image captioning and visual question answering, and features culturally rich and regionally diverse content. This dataset aims to assess the ability of VLMs to generalize across dialects and accurately interpret cultural elements in visual contexts. In an evaluation of five prominent open-source Arabic VLMs and GPT-4o, we find that the Arabic VLMs consistently underperform, struggling with both visual understanding and dialect-specific generation. While GPT-4o ranks best in this comparison, the model’s linguistic competence varies across dialects, and its visual understanding capabilities lag behind. This underscores the need for more inclusive models and the value of culturally-diverse evaluation paradigms.
Citation
K. Kadaoui, H. Atwany, H. Al-Ali, A. Mohamed, A. Mekky, S. Tilga , et al., "JEEM: Vision-Language Understanding in Four Arabic Dialects," 2026, pp. 331-354.
Source
Conference
Findings of the Association for Computational Linguistics: EACL 2026
Keywords
Subjects
Source
Findings of the Association for Computational Linguistics: EACL 2026
Publisher
Association for Computational Linguistics
