EchoActor: Audio-Driven Upper Body Speech Video Generation
Zhao, Jiantong
Zhao, Jiantong
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Generating natural and audio-synchronized semi-body talking videos from a single reference image and an audio clip remains a challenging task. In this work, we present a novel framework, EchoActor: Audio Driven Upper Body Speech Video Generation, which leverages a partial key-point sequence derived from an audio2guesture model and the audio itself to separately control the motion and expression of the character in the reference image. This design enables the generation of videos with largescale upper body movements and expressive facial dynamics. Furthermore, our architecture incorporates an innovative ID-Fusion mechanism, where facial features extracted by a dedicated reference network are seamlessly integrated into the model. This significantly improves the fidelity and clarity of facial representations. Extensive evaluations demonstrate that our approach outperforms existing methods, delivering realistic talking body video synthesis. The code is available at https://github.com/zjt000125/zjt000125.github.io.
Citation
Jiantong Zhao, “EchoActor: Audio-Driven Upper Body Speech Video Generation,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Audio-Driven Video Generation, Upper Body Animation, Talking Head Synthesis, Identity Preservation in Video Synthesis
