BALSAM: A Platform for Benchmarking Arabic Large Language Models
Almatham, Rawan Nasser ; Darwish, Kareem Mohamed ; Al-Rasheed, Raghad ; Alshammari, Waad Thuwaini ; Alhoshan, Muneera ; Almazrua, Amal ; Al Wazrah, Asma ; Alheraki, Mais ; Alam, Firoj ; Nakov, Preslav ... show 10 more
Almatham, Rawan Nasser
Darwish, Kareem Mohamed
Al-Rasheed, Raghad
Alshammari, Waad Thuwaini
Alhoshan, Muneera
Almazrua, Amal
Al Wazrah, Asma
Alheraki, Mais
Alam, Firoj
Nakov, Preslav
Author
Almatham, Rawan Nasser
Darwish, Kareem Mohamed
Al-Rasheed, Raghad
Alshammari, Waad Thuwaini
Alhoshan, Muneera
Almazrua, Amal
Al Wazrah, Asma
Alheraki, Mais
Alam, Firoj
Nakov, Preslav
Alzahrani, Norah A.
Albilali, Eman
Habash, Nizar
El-Sheikh, Abdelrahman Mustafa
Elmallah, Muhammad
Mubarak, Hamdy
Alyafeai, Zaid
Anwar, Mohamed
Li, Haonan
Abdelali, Ahmed
Altwairesh, Nora
Hasanain, Maram
Al-Thubaity, Abdulmohsen
Shehata, Shady
Alhafni, Bashar
Hamed, Injy
Inoue, Go
Elmadani, Khalid N.
Obeid, Ossama
Haouari, Fatima
Elsayed, Tamer
Alghamdi, Emad A.
Almubarak, Khalid
Alshahrani, Saied
Aljareh, Ola
Alajlan, Safa
Alshaqarawi, Areej
Alshihri, Maryam
Alghurabi, Sultana
Alzeghayer, Atikah
Altamimi, Afrah
Alfaifi, Abdullah
Alosaimy, Abdulrahman M.
Darwish, Kareem Mohamed
Al-Rasheed, Raghad
Alshammari, Waad Thuwaini
Alhoshan, Muneera
Almazrua, Amal
Al Wazrah, Asma
Alheraki, Mais
Alam, Firoj
Nakov, Preslav
Alzahrani, Norah A.
Albilali, Eman
Habash, Nizar
El-Sheikh, Abdelrahman Mustafa
Elmallah, Muhammad
Mubarak, Hamdy
Alyafeai, Zaid
Anwar, Mohamed
Li, Haonan
Abdelali, Ahmed
Altwairesh, Nora
Hasanain, Maram
Al-Thubaity, Abdulmohsen
Shehata, Shady
Alhafni, Bashar
Hamed, Injy
Inoue, Go
Elmadani, Khalid N.
Obeid, Ossama
Haouari, Fatima
Elsayed, Tamer
Alghamdi, Emad A.
Almubarak, Khalid
Alshahrani, Saied
Aljareh, Ola
Alajlan, Safa
Alshaqarawi, Areej
Alshihri, Maryam
Alghurabi, Sultana
Alzeghayer, Atikah
Altamimi, Afrah
Alfaifi, Abdullah
Alosaimy, Abdulrahman M.
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The impressive advancement of Large Language Models (LLMs) in English has not been matched across all languages. In particular, LLM performance in Arabic lags behind, due to data scarcity, linguistic diversity of Arabic and its dialects, morphological complexity, etc. Progress is further hindered by the quality of Arabic benchmarks, which typically rely on static, publicly available data, lack comprehensive task coverage, or do not provide dedicated platforms with blind test sets. This makes it challenging to measure actual progress and to mitigate data contamination. Here, we aim to bridge these gaps. In particular, we introduce BALSAM, a comprehensive, community-driven benchmark aimed at advancing Arabic LLM development and evaluation. It includes 78 NLP tasks from 14 broad categories, with 52K examples divided into 37K test and 15K development, and a centralized, transparent platform for blind evaluation. We envision BALSAM as a unifying platform that sets standards and promotes collaborative research to advance Arabic LLM capabilities.
Citation
R. N. Almatham et al., “BALSAM: A Platform for Benchmarking Arabic Large Language Models,” Proceedings of The Third Arabic Natural Language Processing Conference, pp. 258–277, 2025, doi: 10.18653/V1/2025.ARABICNLP-MAIN.21
Source
Proceedings of The Third Arabic Natural Language Processing Conference
Conference
Third Arabic Natural Language Processing Conference
Keywords
Arabic Large Language Models, Benchmarking Platform, Multitask Evaluation, Arabic NLP Tasks, Blind Test Sets, Leaderboard Design, Data Diversity, Community-Driven Benchmarking
Subjects
Source
Third Arabic Natural Language Processing Conference
Publisher
Association for Computational Linguistics
