Item

Expotion: Facial Expression and Motion Control for Multimodal Music Generation

Li, Xinyue
Supervisor
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We propose Expotion (Facial Expression and Motion Control for Multimodal Music Generation), a generative model leveraging multimodal visual controls- specifically, human facial expressions and upper-body motion- as well as text prompts to produce expressive and temporally accurate music. We adopt parameterefficient finetuning (PEFT) on the pretrained text-to-music generation model, enabling finegrained adaptation to the multimodal controls using a small dataset with only 2k steps of finetuning. To ensure precise synchronization between video and music, we introduce a temporal smoothing strategy to align multiple modalities. Experiments demonstrate that integrating visual features alongside textual descriptions improve the generated music in terms of general music quality, tempo consistency, temporal alignment with the video, text adherence, and video adherence, surpassing both proposed baselines and existing state-of-the-art videotomusic generation models. According to human feedback, our model also outperforms the baselines in generating music that is more creative and musical. Additionally, we introduce a novel dataset consisting of 7 hours of synchronized video recordings capturing expressive facial and upperbody gestures aligned with corresponding music, providing significant potential for future research in multimodal and interactive music generation. Demos are available at =HYPERLINK("https://expotion2025.github.io/expotion", "https://expotion2025.github.io/expotion")
Citation
Xinyue Li, “Expotion: Facial Expression and Motion Control for Multimodal Music Generation,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Multimodal Music Generation, Background Music Generation, Parameter-Efficient Fine-tuning
Subjects
Source
Publisher
DOI
Full-text link