African Speech Language Identification (LID)
Ghebremichael, Nahom Tesfu
Ghebremichael, Nahom Tesfu
Author
Supervisor
Department
Machine Learning
Embargo End Date
2026-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Language Identification (LID) plays a pivotal role in multilingual speech systems, powering applications such as automatic speech recognition (ASR), speech translation, and voice-driven interfaces. This is particularly crucial in linguistically diverse regions like Africa, which is home to over 2,000 languages many of which are low-resource, lacking the annotated data necessary for developing robust models. These languages often exhibit significant dialectal variation and phonetic overlap, posing challenges to conventional LID systems. This thesis addresses the development of a scalable and accurate LID model tailored for 59 African languages, overcoming the limitations of prior approaches that either focus on high-resource languages or cover only a small set of low-resource languages with minimal dialectal complexity. The proposed methodology consolidates a wide range of publicly available African speech datasets into a unified corpus. Four pretrained transformer-based models Wav2Vec2, HuBERT, AfriHuBERT, and Whisperare finetuned using Large Margin Softmax (LMS) classifier and a selfattention pooling mechanism. A design to enhance class separability and better capture temporal dependencies in speech. Evaluation using accuracy, macro-F1, and weighted-f1 metrics demonstrates that the Whisper Medium model outperforms all baselines, achieving 87.69% macro-F1 score, with most languages greater than 80% f1 score. The key contributions of this work include the creation of the most comprehensive LID dataset for African languages to date, the development of a highperforming LID architecture, and novel insights into addressing dialectal overlap and data scarcity. These findings offer meaningful advancements in multilingual speech processing by improving the reliability of LID systems for underrepresented languages. They also support broader societal goals such as digital inclusion and linguistic preservation, while providing a strong benchmark for future research in African language identification.
Citation
Nahom Tesfu Ghebremichael, “African Speech Language Identification (LID),” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Speech Language Identification, Self-supervised Learning, Large Margin Softmax
