Loading...
Nanda Family: Open-Weights Generative Large Language Models for Hindi
Singh, Aaryamonvikram ; Banerjee, Debopriyo ; Sahnan, Dhruv ; Choudhury, Monojit ; Chauhan, Shivam ; Das, Rocktim Jyoti ; Han, Xudong ; Li, Haonan ; Jadhav, Alok Anil ; Agarwal, Utkarsh ... show 10 more
Singh, Aaryamonvikram
Banerjee, Debopriyo
Sahnan, Dhruv
Choudhury, Monojit
Chauhan, Shivam
Das, Rocktim Jyoti
Han, Xudong
Li, Haonan
Jadhav, Alok Anil
Agarwal, Utkarsh
Files
Loading...
2026.eacl-long.288.pdf
Adobe PDF, 718.96 KB
Author
Singh, Aaryamonvikram
Banerjee, Debopriyo
Sahnan, Dhruv
Choudhury, Monojit
Chauhan, Shivam
Das, Rocktim Jyoti
Han, Xudong
Li, Haonan
Jadhav, Alok Anil
Agarwal, Utkarsh
Choudhary, Mukund
Koto, Fajri
Bhat, Junaid Hamid
Shukla, Awantika
Ghosh, Samujjwal
Kamboj, Samta
Pandit, Onkar
Pradhan, Lalit
Pal, Rahul
Sahu, Sunil Kumar
Mullah, Parvez
El Filali, Ali
Quraishi, Zainul Abedien Ahmed
Sengupta, Neha
Ramakrishnan, Gokulakrishnan
Joshi, Rituraj
Gosal, Gurpreet
Sheinin, Avraham
Vassilieva, Natalia
Nakov, Preslav
Banerjee, Debopriyo
Sahnan, Dhruv
Choudhury, Monojit
Chauhan, Shivam
Das, Rocktim Jyoti
Han, Xudong
Li, Haonan
Jadhav, Alok Anil
Agarwal, Utkarsh
Choudhary, Mukund
Koto, Fajri
Bhat, Junaid Hamid
Shukla, Awantika
Ghosh, Samujjwal
Kamboj, Samta
Pandit, Onkar
Pradhan, Lalit
Pal, Rahul
Sahu, Sunil Kumar
Mullah, Parvez
El Filali, Ali
Quraishi, Zainul Abedien Ahmed
Sengupta, Neha
Ramakrishnan, Gokulakrishnan
Joshi, Rituraj
Gosal, Gurpreet
Sheinin, Avraham
Vassilieva, Natalia
Nakov, Preslav
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large language models remain predominantly English-centric, which limits their utility for underrepresented languages. We help bridge this gap for Hindi with Llama-3-Nanda-10B-Chat (aka Nanda-10B) and Llama-3.1-Nanda-87B-Chat (aka Nanda-87B), forming the Nanda family of open-weight bilingual models (https://github.com/MBZUAI-IFM/Nanda-Family). Our approach integrates: (i) a tokenizer extending Llama’s vocabulary with 20% Hindi-specific tokens, thus halving Hindi tokenization fertility while preserving English efficiency, (ii) Hindi-first parameter-efficient continual pretraining using Llama Pro on a 65B-token corpus spanning Devanagari script, code-mixed, and Romanized Hindi, and (iii) bilingual instruction and safety alignment on a large culturally grounded dataset. The resulting Nanda models outperform open-weight LLMs of comparable size: Nanda-87B yields high generative quality, and Nanda-10B shows competitive general-purpose performance. Nanda-87B demonstrates state-of-the-art performance on summarization, translation, transliteration, and instruction following. Moreover, both models achieve state-of-the-art performance in safety and in cultural knowledge. Our results demonstrate that careful tokenizer design, data curation, and continual pretraining can yield capable and safe LLMs for resource-poor languages without compromising English performance.
Citation
A. Singh, D. Banerjee, D. Sahnan, M. Choudhury, S. Chauhan, R.J. Das , et al., "Nanda Family: Open-Weights Generative Large Language Models for Hindi," 2026, pp. 6086-6108.
Source
Conference
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Keywords
Subjects
Source
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Publisher
Association for Computational Linguistics
