Loading...
Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models
Mekky, Ali ; El Zeftawy, Mohamed ; Hassan, Lara ; Keleg, Amr ; Nakov, Preslav
Mekky, Ali
El Zeftawy, Mohamed
Hassan, Lara
Keleg, Amr
Nakov, Preslav
Files
Loading...
2026.vardial-1.22.pdf
Adobe PDF, 1017.11 KB
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LahjatBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system.
Citation
A. Mekky, M. El Zeftawy, L. Hassan, A. Keleg, P. Nakov, "Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models," 2026, pp. 261-274.
Source
Conference
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Keywords
39 Education, 3901 Curriculum and Pedagogy, 46 Information and Computing Sciences, 4611 Machine Learning, 47 Language, Communication and Culture
Subjects
Source
Proceedings of the 13th Workshop on NLP for Similar Languages, Varieties and Dialects
Publisher
Association for Computational Linguistics
