Item

Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Laiyk, Nurkhan
Orel, Daniil
Joshi, Rituraj
Goloburda, Maiya
Wang, Yuxia
Nakov, Preslav Ivanov
Koto, Fajri
Citations
Google Scholar:
Altmetric:
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
Citation
N. Laiyk et al., “Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh,” vol. 1, pp. 14509–14538, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.706.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Low-Resource Language Instruction-Tuning, Kazakh Governance & Cultural Data, LLM-Assisted Dataset Generation, Institutional/Cultural Knowledge Integration, High-Quality Manual Verification, Fine-Tuning Qwen/Falcon/Gemma, Multiple-Choice & Generative Tasks, Localised Context for LLMs
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
Full-text link