Loading...
StyleDiffusion: Prompt-embedding inversion for text-based editing
Li, Senmao ; van de Weijer, Joost ; Hu, Taihang ; Khan, Fahad Shahbaz ; Hou, Qibin ; Wang, Yaxing ; Yang, Jian ; Cheng, Ming-Ming
Li, Senmao
van de Weijer, Joost
Hu, Taihang
Khan, Fahad Shahbaz
Hou, Qibin
Wang, Yaxing
Yang, Jian
Cheng, Ming-Ming
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
License
http://creativecommons.org/licenses/by/4.0/
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
A significant research effort is focused on exploiting the outstanding capacities of pretrained diffusion models for image editing. Approaches either fine tune the model, or invert the image in the latent space of the pretrained model. However, they suffer from two problems: (i) unsatisfactory results in selected regions and unexpected changes in non selected regions, and (ii) the need for careful text prompt editing: the prompt should include all visual objects in the input image. To address this, we propose two improvements: (i) only optimizing the input of the value linear network in the cross-attention layers is sufficiently powerful to reconstruct a real image, and (ii) attention regularization to preserve the object-like attention maps after reconstruction and editing, enabling accurate style editing without causing significant structural change. We further improve the editing technique used for the unconditional branch of classifier-free guidance as used by P2P. Extensive experimental prompt-editing results on a variety of images demonstrate qualitatively and quantitatively that our method has editing capabilities superior to those of existing and concurrent works. Our StyleDiffusion code is available at https://github.com/sen-mao/StyleDiffusion.
Citation
S. Li, J. van de Weijer, T. Hu, F.S. Khan, Q. Hou, Y. Wang , et al., "StyleDiffusion: Prompt-embedding inversion for text-based editing," Computational Visual Media, pp. 1-21, 2026, https://doi.org/10.26599/cvm.2025.9450462.
Source
Computational Visual Media
Conference
Keywords
46 Information and Computing Sciences, 4603 Computer Vision and Multimedia Computation, 4607 Graphics, Augmented Reality and Games, 4611 Machine Learning
Subjects
Source
Publisher
Tsinghua University Press
