Item

The Arabic Generality Score: Another Dimension in Modelling Arabic Dialectness

Shaban, Sanad
Supervisor
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Arabic is a diglossic language characterized by the coexistence of Modern Standard Arabic (MSA), and a wide range of spoken dialects. In natural language processing (NLP), these varieties were traditionally treated as discrete categories, butin reality, they form a multi-layered continuum, where dialects overlap with one another and with MSA to varying degrees. The field has recently begun to attend tothis continuum, with efforts aimed at developing more nuanced representations of dialectness, including multi-label classification and continuous scoring methods. The Arabic Level of Dialectness (ALDi), which assigns a continuous score in the range [0, 1], is one such method. It offers a framework for estimating where a sentence lies along the MSA–dialectal continuum. While ALDi has proven useful—particularly in explaining inter-annotator disagreement in dialectal corpora—it remains a one dimensional model and compresses multiple linguistic factors into a single score. This thesis investigates a complementary dimension of dialectness: generality vs. specificity. We introduce the Arabic Generality Score (AGS), a new score that captures how broadly a linguistic feature (e.g., a word) is used across Arabic varieties. AGS is defined in the range [0, 1], where higher values indicate broader usage. While AGS may correlate with ALDi, we show that it is not fully captured by it, motivating its explicit modeling. To that end, we propose a novel pipeline for annotating a multi-dialectal parallel corpus with generality scores. The pipeline consists of word alignment, an etymologically and phonologically informed edit distance function, and a final aggregation step to compute AGS. We model word-level AGS estimation as a regression task: given a sentence, estimate the AGS for the marked word. For evaluation, we aggregate word-level AGS predictions into sentence-level scores and compare performance against standard multi-dialect identification (MDID) models. Our results show that the proposed pipeline captures generality more effectively, with lower RMSE scores across test sets. Notably, training on the full MADAR-26 dataset yields no clear advantage over MADAR-6, suggesting that the six major dialects provide sufficient coverage for modeling generality. This work contributes a new dimension to computational models of Arabic dialectness, a methodology for linguistically-informed generality estimation, a set of annotated resources, and empirical evidence supporting the distinctiveness and utility of the generality signal. Together, these contributions offer a more nuanced approach to representing dialectal variation in Arabic NLP.
Citation
Sanad Shaban, “The Arabic Generality Score: Another Dimension in Modelling Arabic Dialectness,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Arabic Dialectness, Arabic Generality Score (AGS), Arabic Dialect Continuum, Linguistic Generality
Subjects
Source
Publisher
DOI
Full-text link