NusaDialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages
Purwarianti, Ayu ; Adhista, Dea ; Baptiso, Agung ; Mahfuzh, Miftahul ; Sabila, Yusrina ; Adila, Aulia ; Cahyawijaya, Samuel ; Aji, Alham Fikri
Purwarianti, Ayu
Adhista, Dea
Baptiso, Agung
Mahfuzh, Miftahul
Sabila, Yusrina
Adila, Aulia
Cahyawijaya, Samuel
Aji, Alham Fikri
Citations
Altmetric:
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Workshop
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Developing dialogue summarization for extremely low-resource languages is a challenging task. We introduce NusaDialogue, a dialogue summarization dataset for three underrepresented languages in the Malayo-Polynesian language family: Minangkabau, Balinese, and Buginese. NusaDialogue covers 17 topics and 185 subtopics, with annotations provided by 73 native speakers. Additionally, we conducted experiments using fine-tuning on a specifically designed medium-sized language model for Indonesian, as well as zero- and few-shot learning on various multilingual large language models (LLMs). The results indicate that, for extremely low-resource languages such as Minangkabau, Balinese, and Buginese, the fine-tuning approach yields significantly higher performance compared to zero- and few-shot prompting, even when applied to LLMs with considerably larger parameter sizes.
Citation
A. Purwarianti et al., “NusaDialogue: Dialogue Summarization and Generation for Underrepresented and Extremely Low-Resource Languages,” 2025. Accessed: Mar. 12, 2025. [Online]. Available: https://aclanthology.org/2025.sealp-1.8/
Source
Proceedings of the Second Workshop in South East Asian Language Processing, 2025
Conference
Keywords
NusaDialogue dataset, Dialogue summarization, Low-resource languages, Malayo-Polynesian languages, Large Language Models (LLMs)?
Subjects
Source
Publisher
Association for Computational Linguistics
