Item

The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages

Rajab, Jenalea
Aremu, Anuoluwapo
Chimoto, Everlyn Asiko
Dunbar, Dale
Morrissey, Graham
Thior, Fadel
Potgieter, Luandrie
Ojo, Jessica
Tonja, Atnafu Lambebo
Nekoto, Wilhelmina Ndapewa Onyothi
... show 4 more
Research Projects
Organizational Units
Journal Issue
Abstract
This paper presents the Esethu Framework, a sustainable data curation framework specifically designed to empower local communities and ensure equitable benefit-sharing from their linguistic resource. This framework is supported by the Esethu license, a novel community-centric data license. As a proof of concept, we introduce the Vuk'uzenzele isiXhosa Speech Dataset (ViXSD), an open-source corpus developed under the Esethu Framework and License. The dataset, containing read speech from native isiXhosa speakers enriched with demographic and linguistic metadata, demonstrates how community-driven licensing and curation principles can bridge resource gaps in automatic speech recognition (ASR) for African languages while safeguarding the interests of data creators. We describe the framework guiding dataset development, outline the Esethu license provisions, present the methodology for ViXSD, and present ASR experiments validating ViXSD's usability in building and refining voice-driven applications for isiXhosa.
Citation
J. Rajab et al., “The Esethu Framework: Reimagining Sustainable Dataset Governance and Curation for Low-Resource Languages,” vol. 1, pp. 30763–30776, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.1487.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Sustainable Dataset Governance, Low-Resource Languages, Community-Driven Curation, Cultural and Linguistic Agency, Community-Centric License, ASR Dataset Creation, Indigenous Language Tech, Equitable Benefit-Sharing
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
Full-text link