Beyond Cairo: Sa’idi Egyptian Arabic Corpus Construction and Analysis
Eida, Mai Mohamed ; Habash, Nizar
Eida, Mai Mohamed
Habash, Nizar
Author
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Egyptian Arabic (EA) NLP resources have mainly focused on Cairene Egyptian Arabic (CEA), leaving sub-dialects like Sa’idi Egyptian Arabic (SEA) underrepresented. This paper introduces the first SEA corpus – an open-source, 4-million-word literary dataset of a dialect spoken by ~30 million Egyptians. To validate its representation, we analyze SEA-specific linguistic features from dialectal surveys, confirming a higher prevalence in our corpus compared to existing EA datasets. Our findings offer insights into SEA’s orthographic representation in morphology, phonology, and lexicon, incorporating CODA* guidelines for normalization.
Citation
M. M. Eida and N. Habash, “Beyond Cairo: Sa’idi Egyptian Arabic Corpus Construction and Analysis,” 2025. Accessed: May 05, 2025. [Online]. Available: https://aclanthology.org/2025.nlp4dh-1.26/
Source
Proceedings of the 5th International Conference on Natural Language Processing for Digital Humanities
Conference
NAACL 2025
Keywords
Subjects
Source
NAACL 2025
Publisher
Association for Computational Linguistics
