Item

LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation

Song, Wenhui
Li, Hanhui
Huang, Jiehui
Hu, Panwen
Cheng, Yuhao
Chen, Long
Yan, Yiqiang
Liang, Xiaodan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In this paper, we present LaVieID, a novel local a utoregressive vi deo diffusion framework designed to tackle the challenging id entity-preserving text-to-video task. The key idea of LaVieID is to mitigate the loss of identity information inherent in the stochastic global generation process of diffusion transformers (DiTs) from both spatial and temporal perspectives. Specifically, unlike the global and unstructured modeling of facial latent states in existing DiTs, LaVieID introduces a local router to explicitly represent latent states by weighted combinations of fine-grained local facial structures. This alleviates undesirable feature interference and encourages DiTs to capture distinctive facial characteristics. Furthermore, a temporal autoregressive module is integrated into LaVieID to refine denoised latent tokens before video decoding. This module divides latent tokens temporally into chunks, exploiting their long-range temporal dependencies to predict biases for rectifying tokens, thereby significantly enhancing inter-frame identity consistency. Consequently, LaVieID can generate high-fidelity personalized videos and achieve state-of-the-art performance. Our code and models are available at https://github.com/ssugarwh/LaVieID.
Citation
W. Song et al., “LaVieID: Local Autoregressive Diffusion Transformers for Identity-Preserving Video Creation,” Proceedings of the 33rd ACM International Conference on Multimedia, vol. 10, pp. 9529–9538, Oct. 2025, doi: 10.1145/3746027.3754943
Source
MSMA '25: Proceedings of the 1st International Workshop on Multi-Sensorial Media and Applications
Conference
The 33rd ACM International Conference on Multimedia
Keywords
Video Synthesis, Diffusion Model, Spatio-temporal Consistency
Subjects
Source
The 33rd ACM International Conference on Multimedia
Publisher
Association for Computing Machinery
Full-text link