Item

NUSAAKSARA: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

Adilazuarda, Muhammad Farid
Wijanarko, Musa Izzanardi
Susanto, Lucky
Nur'aini, Khumaisa
Wijaya, Derry Tanti
Aji, Alham Fikri
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Indonesia boasts over 700 languages, with a rich diversity of writing systems. However, most NLP development has been based on romanized text, with limited support for native writing systems. We present NUSAAKSARA, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NUSAAKSARAcovers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks.Among the scripts covered in this dataset, the Lampung script is included despite being unsupported by Unicode. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID. Our results reveal that most NLP technologies struggle with Indonesias local scripts, with many achieving near-zero performance.
Citation
M. F. Adilazuarda, M. I. Wijanarko, L. Susanto, K. Nur’aini, D. T. Wijaya, and A. F. Aji, “NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts,” vol. 1, pp. 28371–28401, Aug. 2025, doi: 10.18653/V1/2025.ACL-LONG.1377
Source
Proceedings of the Annual Meeting of the Association for Computational Linguistics
Conference
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Keywords
Indonesian Indigenous Scripts, Multimodal Benchmark, Multilingual NLP, OCR & Transliteration Tasks, Low-Resource Languages, Script Preservation Dataset, Image+Text Modalities, Language Identification in Diverse Scripts
Subjects
Source
63rd Annual Meeting of the Association for Computational Linguistics, ACL 2025
Publisher
Association for Computational Linguistics
Full-text link