Uncertainty Estimation for Partial Diacritization in the Arabic Language
Alblooshi, Humaid Ali Ahmed Mohammed
Alblooshi, Humaid Ali Ahmed Mohammed
Supervisor
Department
Natural Language Processing
Embargo End Date
30/05/2025
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This thesis proposes an uncertainty-driven approach for partial Arabic diacritization, addressing the limitations associated with fully diacritized text in downstream tasks, as well as model calibration issues through regularization schemes. We leverage uncertainty estimation techniques - Softmax Response, Monte Carlo Dropout (BALD), and Mahalanobis Distanceto selectively retain diacritics. Our approach is extensively evaluated, demonstrating consistent performance in terms of diacritic error rate (DER), precision in identifying problematic datapoints, as well as improving correlation between confidence scores and accuracy, compared to the baseline model. Furthermore, we apply our proposed partial diacritization methodology to Arabic text-to-speech (TTS) synthesis using the VITS model, trained on a newly introduced dataset. Preliminary results indicate that full diacritization enhances synthesized speech quality and naturalness. This work will compare several diacritization schemes for speech synthesis. Overall, this thesis contributes a principled, effective method for Arabic diacritization, providing clear guidelines for uncertainty-driven selective diacritic restoration and laying the groundwork for future research in uncertainty-informed natural language processing applications.
Citation
Humaid Ali Ahmed Mohammed Alblooshi, “Uncertainty Estimation for Partial Diacritization in the Arabic Language,” Master of Science thesis, Natural Language Processing, MBZUAI, 2025.
Source
Conference
Keywords
Arabic Diacritization, Uncertainty Estimation, Arabic Natural Language Processing
