Item

Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study

Toyin, Hawau Olamide
Magdy, Samar Mohamed
Al Darmaki, Hanan
Citations
Google Scholar:
Altmetric:
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We investigate the effectiveness of large language models (LLMs) for text diacritization in two typologically distinct languages: Arabic and Yoruba. To enable a rigorous evaluation, we introduce a novel multilingual dataset MultiDiac, with diverse samples that capture a range of diacritic ambiguities. We evaluate 12 LLMs varying in size, accessibility, and language coverage, and benchmark them against 4 specialized diacritization models. Additionally, we fine-tune four small open-source models using LoRA for Yoruba. Our results show that many off-the-shelf LLMs outperform specialized diacritization models for both Arabic and Yoruba, but smaller models suffer from hallucinations. We find that fine-tuning on a small dataset can help improve diacritization performance and reduce hallucination rates for Yoruba.
Citation
H.O. Toyin, S.M. Magdy, H. Al Darmaki, "Are LLMs Good Text Diacritizers? An Arabic and Yoruba Case Study," 2026, pp. 580-589.
Source
Conference
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Keywords
Subjects
Source
The Fifteenth Language Resources and Evaluation Conference (LREC 2026)
Publisher
ELDA
Full-text link