Item

Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh

Laiyk, Nurkhan
Orel, Daniil
Joshi, Rituraj
Goloburda, Maiya
Wang, Yuxia
Nakov, Preslav Ivanov
Koto, Fajri
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Instruction tuning in low-resource languages remains underexplored due to limited text data, particularly in government and cultural domains. To address this, we introduce and open-source a large-scale (10,600 samples) instruction-following (IFT) dataset, covering key institutional and cultural knowledge relevant to Kazakhstan. Our dataset enhances LLMs' understanding of procedural, legal, and structural governance topics. We employ LLM-assisted data generation, comparing open-weight and closed-weight models for dataset construction, and select GPT-4o as the backbone. Each entity of our dataset undergoes full manual verification to ensure high quality. We also show that fine-tuning Qwen, Falcon, and Gemma on our dataset leads to consistent performance improvements in both multiple-choice and generative tasks, demonstrating the potential of LLM-assisted instruction tuning for low-resource languages.
Citation
N. Laiyk et al., “Instruction Tuning on Public Government and Cultural Data for Low-Resource Language: a Case Study in Kazakh,” vol. 1, pp. 14509–14538, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.706.
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Low-Resource Language Instruction-Tuning, Kazakh Governance & Cultural Data, LLM-Assisted Dataset Generation, Institutional/Cultural Knowledge Integration, High-Quality Manual Verification, Fine-Tuning Qwen/Falcon/Gemma, Multiple-Choice & Generative Tasks, Localised Context for LLMs
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
Full-text link