sphinx: Sample efficient multilingual instruction fine-tuning through n-shot guided prompting
Ahuja, Sanchit ; Tanmay, Kumar ; Chauhan, Hardik Hansrajbhai ; Patra, Barun ; Aggarwal, Kriti ; Del Corro, Luciano ; Mitra, Arindam ; Dhamecha, Tejas Indulal ; Awadallah, Ahmed Hassan ; Choudhury, Monojit ... show 2 more
Ahuja, Sanchit
Tanmay, Kumar
Chauhan, Hardik Hansrajbhai
Patra, Barun
Aggarwal, Kriti
Del Corro, Luciano
Mitra, Arindam
Dhamecha, Tejas Indulal
Awadallah, Ahmed Hassan
Choudhury, Monojit
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Despite the remarkable success of large language models (LLMs) in English, a significant performance gap remains in non-English languages. To address this, we introduce a novel approach for strategically constructing a multilingual synthetic instruction tuning dataset, sPhinX. Unlike prior methods that directly translate fixed instruction-response pairs, sPhinX enhances diversity by selectively augmenting English instruction-response pairs with multilingual translations. Additionally, we propose LANGIT, a novel N-shot guided fine-tuning strategy, which further enhances model performance by incorporating contextually relevant examples in each training sample. Our ablation study shows that our approach enhances the multilingual capabilities of Mistral-7B and Phi-3-Small improving performance by an average of 39.8% and 11.2%, respectively, across multilingual benchmarks in reasoning, question answering, reading comprehension, and machine translation. Moreover, sPhinX maintains strong performance on English LLM benchmarks while exhibiting minimal to no catastrophic forgetting, even when trained on 51 languages.
Citation
S. Ahuja et al., “sPhinX: Sample Efficient Multilingual Instruction Fine-Tuning Through N-shot Guided Prompting,” 2025. [Online]. Available: https://aclanthology.org/2025.gem-1.73/
Source
Proceedings of the Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Conference
Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Keywords
Multilingual Instruction Tuning, Synthetic Instruction Dataset, N-shot Guided Prompting, Multilingual LLMs, Performance Gap in Non-English Languages, Selective Augmentation, Diversity in Training Data, Minimal Catastrophic Forgetting
Subjects
Source
Fourth Workshop on Generation, Evaluation and Metrics (GEM²)
Publisher
Association for Computational Linguistics
