Item

An Effective and Finetuning-free Pipeline for Face Swapping with Diffusion models

Nguyen, Thao Do Chi
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In current face-swapping tasks, a lot of work mainly relies on Generative adversarial networks (GAN) frameworks; although there are positive results in face-swapping tasks, output images still remain with drawbacks, often with noisy artifacts, particularly in scenarios involving diverse pose variation, lightning conditions, and face occlusion. On the other hand, recent work has emerged on pretrained diffusion models for their exceptional generation performance. Nevertheless, training these models is not trivial, it requires intensive resource, and the results have yet to demonstrate satisfactory efficacy. That is, training a Text-to-Image (T2I) model requires a massive dataset with high computational resources. Meanwhile, finetuning these models can lead to catastrophic forgetting due to not having enough training epochs or the volume of datasets. To overcome this disadvantage, I introduce an efficient and free-tuning pipeline designed for accurate and photorealistic facial editing leveraging pre-trained text-to-image (T2I) models. From Source images and Target images, the goal is to preserve and transfer the identity attributes of the source within the target representation, while maintaining high fidelity and preserving ID-unrelated attributes. I assume that the face-swapping’s outputs should efficiently preserve structural attributes from target images and identities of the source images. Base on these insights, our contributions specifically include: 1. I re-define the face-swapping task as an inpainting task, preserving the identity from the source image while being projected into the target image’s feature space during the denoising process. 2. I integrate an Identity Adapter that extracts facial features and transfers these ID embeddings to the diffusion models through cross-attention layers 3. I implement a Face Structure Module that controls spatial conditions through MultiControlNet models. Through various comparisons on the FFHQ and CelebAMask-HQ datasets in terms of qualitative and quantitative results, I evaluate the efficiency of my approach in retaining identity, preserving structure and expression, and achieving high-fidelity, realistic face-swapping. Our method performs competitively against fully fine-tuned face-swapping models, which require extensive fine-tuning over tens of days.
Citation
Thao Do Chi Nguyen, “An Effective and Finetuning-free Pipeline for Face Swapping with Diffusion models,” Master of Science thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Diffusion models, Generative models, Face-swapping
Subjects
Source
Publisher
DOI
Full-text link