Beyond Text: Leveraging Audio Utterances to Enhance Diacritic Restoration
Shatnawi, Sara
Shatnawi, Sara
Author
Supervisor
Department
Natural Language Processing
Embargo End Date
01/01/2024
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Automatic diacritization plays a vital role in improving the readability and comprehension of Arabic text. However, current diacritic restoration models encounter difficulties when applied to transcribed speech due to shifts in domain and style inherent in spoken language. Researchers developing a text-to-speech system for Arabic identified a significant issue: synthesized speeches contain numerous pronunciation errors, largely stemming from the absence of diacritics in Modern Standard Arabic writing. In Modern Standard Arabic, texts are typically devoid of diacritical markings, which are essential for disambiguating word senses and meanings. The absence of these markings can lead to ambiguity, posing challenges for various Arabic applications such as information retrieval, machine translation, and text-to-speech. Thus, integrating diacritics into Arabic text is crucial for enhancing accuracy and effectiveness across these domains. This research investigates the possibility of enhancing the automatic restoration of diacritics in speech data by utilizing parallel spoken utterances. Particularly, we developed two frameworks: ASR+Text and Audio+Text. The ASR+Text framework uses a pretrained Automatic Speech Recognition (ASR) model to generate preliminary diacritized text, which is then refined in conjunction with raw text data. On the other hand, the Audio+Text framework incorporates direct audio features along with the textual data, employing several techniques such as clustering features from models like HuBERT and Wav2Vec and fine-tuning the XLS-R model for ASR objective. Our methodology involved conducting a comparative analysis of various results against pre-existing text-only diacritic restoration models. The evaluation of our proposed models, which use audio features, revealed a relative reduction in diacritic error rates - 45% for Text+ASR and 43% for Text+Audio. This highlights the substantial benefits of incorporating audio data.
Citation
S. Shantnawi, "Beyond Text: Leveraging Audio Utterances to Enhance Diacritic Restoration", M.S. Thesis, Natural Language Processing, MBZUAI, Abu Dhabi, UAE, 2024.
