Item

Investigating the Effect of Diacritics on Arabic Speech Models

Mohammed, Ibrahim Ali Ibrahim Ali
Department
Machine Learning
Embargo End Date
2026-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Arabic is a morphologically rich language in which short vowels, indicated by diacritics, play a crucial role in determining both pronunciation and meaning. However, these diacritics are frequently omitted in written Arabic, as native speakers can infer them from context. This omission introduces significant ambiguity for both human readers and NLP systems, particularly in tasks such as Automatic Speech Recognition (ASR) and Text-to-Speech synthesis (TTS), where accurate pronunciation is essential. Modern Arabic NLP models are typically trained on undiacritized text, limiting their ability to model the full phonological structure of the language. In this thesis, we address the problem of diacritic omission by explicitly integrating diacritics into the pretraining of multimodal transformer models. We investigate whether diacriticaware pretraining can improve performance, especially for Arabic speech synthesis. Our contributions include a detailed study of ASR-based diacritization, an evaluation of data augmentation strategies, and the development of a Diacritized Arabic Text and Speech Transformer (ArTST). Comprehensive evaluations across diverse Arabic domains demonstrate that the diacritic-aware ArTST model consistently achieves the lowest ASR error rates among state-of-the-art systems. Introducing random diacritic augmentation has a negligible effect on performance, neither significantly improving nor degrading results. In TTS preference tests, including diacritics during fine-tuning delivers substantial gains in naturalness and intelligibility, while an additional diacritic-aware pretraining phase yields a modest but consistent further improvement over fine-tuning alone.
Citation
Ibrahim Ali Ibrahim Ali Mohammed, “Investigating the Effect of Diacritics on Arabic Speech Models,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Arabic, Diacritics, Automatic Speech Recognition, Text-to-Speech, Multimodal Transformer Model, short vowels
Subjects
Source
Publisher
DOI
Full-text link