Loading...
Exploring Adapter Design Tradeoffs for Low Resource Music Generation
Mehta, Atharva ; Chauhan, Shivam ; Choudhury, Monojit
Mehta, Atharva
Chauhan, Shivam
Choudhury, Monojit
Files
Loading...
3746027.3755766.pdf
Adobe PDF, 1.37 MB
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Fine-tuning large-scale music audio generation models, such as MusicGen and Mustango, is a computationally expensive process, often requiring updates to billions of parameters and, therefore, significant hardware resources. Parameter-Efficient Fine-Tuning (PEFT) techniques, particularly adapter-based methods, have emerged as a promising alternative, enabling adaptation with minimal trainable parameters while preserving model performance. However, the design choices for adapters, including their architecture, placement, and size, are numerous, and it is unclear which of these combinations would produce optimal adapters and why, for a given case of low-resource music genre. In this paper, we attempt to answer this question by studying various adapter configurations for two AI music models, MusicGen and Mustango, on two genres: Hindustani Classical and Turkish Makam music.
Our findings reveal distinct trade-offs: convolution-based adapters excel in capturing fine-grained local musical details such as ornamentations and short melodic phrases, while transformer-based adapters better preserve long-range dependencies crucial for structured improvisation. Additionally, we analyze computational resource requirements across different adapter scales, demonstrating how mid-sized adapters (40M parameters) achieve an optimal balance between expressivity and quality. Furthermore, we find that Mustango, a diffusion-based model, generates more diverse outputs with better adherence to the description in the input prompt while lacking in providing stability in notes, rhythm alignment, and aesthetics. Also, it is computationally intensive and requires significantly more time to train. In contrast, autoregressive models like MusicGen offer faster training and are more efficient, and can produce better quality output in comparison, but have slightly higher redundancy in their generations. We release our datasets, models and training code in the following github repository: https://github.com/atharva20038/ACMMM_Adapters/tree/main.
Citation
A. Mehta, S. Chauhan, and M. Choudhury, “Exploring Adapter Design Tradeoffs for Low Resource Music Generation,” Proceedings of the 33rd ACM International Conference on Multimedia, pp. 10554–10562, Oct. 2025, doi: 10.1145/3746027.3755766.
Source
MSMA '25: Proceedings of the 1st International Workshop on Multi-Sensorial Media and Applications
Conference
The 33rd ACM International Conference on Multimedia
Keywords
AI Music, Parameter-Efficient Fine-Tuning, Adapter-Based Learning, Non-Western Music, Hindustani Classical, Turkish Makam, Diffusion Models, Autoregressive Models
Subjects
Source
The 33rd ACM International Conference on Multimedia
Publisher
Association for Computing Machinery
