Towards Scalable Multilingual AI:Benchmarking CompressedTransformer Models Across Languages
Alshehhi, Maitha Khaled Ahmed Saeed
Alshehhi, Maitha Khaled Ahmed Saeed
Supervisor
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large Language Models (LLMs) have demonstrated remarkable proficiency in natural language processing (NLP) tasks, yet their effectiveness varies significantly across different languages. High-resource languages like English benefit from extensive training data and well-established benchmarks, while low-resource languages such as Arabic and Indic face performance disparities, higher hallucination rates, and limited generalization capabilities. This thesis evaluates and compares the performance of multilingual and monolingual LLMs across Arabic, English, and Indic datasets using Arabic MMLU, English MMLU, and Indic benchmark. To uncover these disparities, this thesis conducts a detailed assessment of various decoder-only Transformer models, including BLOOMZ, AceGPT, Jais, LLaMA-2, XGLM, and AraGPT2. The research follows a two-pronged approach: first, evaluating the models in their full-precision form, and second, applying compression techniques like weight pruning and quantization to assess their impact on accuracy and efficiency. A well-structured evaluation framework is employed, integrating refined prompt engineering, strategic inference techniques, and probability-based answer selection to ensure thorough and fair performance comparisons. Additionally, this thesis examines model outputs to detect and quantify hallucinations, offering insights into misinformation patterns across different linguistic settings. The experiments utilize a highperformance computing environment, enabling efficient execution of largescale inference tasks. The models are tested across multiple domains, including STEM, humanities, and general knowledge, to provide a comprehensive understanding of their strengths and limitations. Particular attention is given to performance variations under different compression levels, aiming to identify the best balance between computational efficiency and accuracy. Findings reveal that BLOOMZ7.1B demonstrates strong multilingual generalization, while AceGPT13B and XGLM achieve competitive results but exhibit languagespecific strengths. Monolingual models like AraGPT2 perform well in Arabiclanguage tasks but struggle with crosslingual generalization. Quantization techniques (4bit and 8bit) help maintain model accuracy while reducing memory usage, making them a practical optimization strategy for deployment. However, aggressive pruning (exceeding 40%) significantly impacts performance, leading to lower accuracy and diminished confidence scores. Moreover, this thesis highlights a higher incidence of hallucinations in low-resource languages, with Arabic models frequently generating incorrect responses. Biases in multilingual benchmarks also affect evaluation outcomes, as existing datasets often lack diverse linguistic and cultural representation. These insights underscore the need for more inclusive benchmarking methodologies to ensure fair and accurate assessments of LLMs. By addressing languagespecific challenges and compression tradeoffs, this research contributes to AI fairness and multilingual NLP. Ultimately, this thesis seeks to bridge the gap between high-resource and low-resource languages, fostering more inclusive progress in AI.
Citation
Maitha Khaled Ahmed Saeed Alshehhi, “Towards Scalable Multilingual AI:Benchmarking CompressedTransformer Models Across Languages,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Large Language Models (LLM), Model Compression, Evaluation and Analysis in Machine Learning, Fairness and Bias, Multilingualism and Linguistic Diversity
