Morphemes without Borders: Evaluating Root–Pattern Morphology in Arabic Tokenizers and LLMs
Alakeel, Yara Yousif ; Qwaider, Chatrine ; Al Darmaki, Hanan ; Alqahtani, Sawsan
Alakeel, Yara Yousif
Qwaider, Chatrine
Al Darmaki, Hanan
Alqahtani, Sawsan
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root–pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root–pattern generation using a newly developed benchmark. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.
Citation
Y.Y. Alakeel, C. Qwaider, H. Al Darmaki, S. Alqahtani, "Morphemes without Borders: Evaluating Root–Pattern Morphology in Arabic Tokenizers and LLMs," 2026, pp. 11787-11799.
Source
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Keywords
Subjects
Source
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Publisher
ELDA
