Item

Uncertainty-driven Partial Diacritization for Arabic Text

Alblooshi, Humaid Ali
Shelmanov, Artem
Aldarmaki, Hanan
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We present an uncertainty-based approach to Partial Diacritization (PD) for Arabic text. We evaluate three uncertainty metrics for this task: Softmax Response, BALD via MC-dropout, and Mahalanobis Distance. We further introduce a lightweight Confident Error Regularizer to improve model calibration. Our preliminary exploration illustrates possible ways to use uncertainty estimation for selectively retaining or discarding diacritics in Arabic text with an analysis of performance in terms of correlation with diacritic error rates. For instance, the model can be used to detect words with high diacritic error rates which tend to have higher uncertainty scores at inference time. On the Tashkeela dataset, the method maintains low Diacritic Error Rate while reducing the amount of visible diacritics on the text by up to 50% with thresholding-based retention.
Citation
Alblooshi, Humaid Ali
Shelmanov, Artem
Aldarmaki, Hanan
Source
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Conference
2025 Conference on Empirical Methods in Natural Language Processing
Keywords
Arabic Text Diacritization, Uncertainty Estimation, Softmax Response, BALD via MC-dropout, Mahalanobis Distance Metric, Confident Error Regularizer, Model Calibration, Partial Diacritization Thresholding
Subjects
Source
2025 Conference on Empirical Methods in Natural Language Processing
Publisher
Association for Computational Linguistics
Full-text link