Item

Face Expression and Upper Body Movement as Multimodal Control for Music Generation

Izzati, Fathinah Asma
Supervisor
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This paper explores the task of generating music given multimodal visual information of face expression and upper body movementas well as text prompts to produce expressive and temporally accurate music. We adopt parameter-efficient finetuning (PEFT) on the pretrained texttomusic generation model, enabling finegrained adaptation to the multimodal controls using a small dataset with only 2k steps of finetuning. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions enhances the overall quality of generated music in terms of musicality, creativity, beat tempo consistency, temporal alignment with the video, and text adherence, surpassing both proposed baselines and existing state-of-the-art video-to-music generation models. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upperbody gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation. Demos are available at https: //expotion2025.github.io/expotion.
Citation
Fathinah Asma Izzati, “Face Expression and Upper Body Movement as Multimodal Control for Music Generation,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Music Generation, Multimodal Control, Music Information Retrieval
Subjects
Source
Publisher
DOI
Full-text link