Monocular-to-3D Virtual Try-On with Generative Semantic Articulated Fields
Xie, Zhenyu ; Zhao, Fuwei ; Zheng, Jun ; Dong, Xin ; Zhu, Feida ; Liang, Xiaodan
Xie, Zhenyu
Zhao, Fuwei
Zheng, Jun
Dong, Xin
Zhu, Feida
Liang, Xiaodan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We introduce a monocular-to-3D virtual try-on network based on a conditional 3D-aware Generative Adversarial Network (3D-GAN) for synthesizing multi-view try-on results from single monocular images. In contrast to previous 3D virtual try-on methods that rely on costly scanned meshes or pseudo-depth maps for supervision, our approach utilizes a conditional 3D-GAN trained solely on 2D images, greatly simplifying dataset construction and enhancing model scalability. Specifically, we propose a Generative monocular-to-3D Virtual Try-ON network (G3D-VTON) that integrates a 3D-aware conditional Parsing Module (3DPM), a U-Net Refinement Module (URM), and a Flow-based 2D Virtual Try-On Module (FTM). In our framework, the 3DPM is designed to generate a 3D representation of the virtual try-on result, thereby enabling multi-view rendering. To accomplish this, it is implemented using conditional generative semantic articulated fields, which leverage the 3D SMPL prior via inverse skinning to learn the Signed Distance Function (SDF) of the try-on results in a canonical pose space. This learned SDF enables the rendering of both a coarse human parsing map and a preliminary try-on output with explicit camera control. Furthermore, within 3DPM, we introduce deferred pose guidance to decouple style and pose conditions during training, thereby facilitating view controllable generation during inference. However, the rendered human parsing and try-on results exhibit imprecise shapes and blurry textures. To address these issues, the URM subsequently refines these rendered outputs using a refinement U-Net, and the FTM integrates the refined results with the 2D warped garment to generate the final try-on output with more accurate and realistic appearance details. Extensive experiments demonstrate that the proposed G3D-VTON effectively manipulates and generates faithful 3D human appearances wearing the desired garment, outperforming both 3D-GAN and depth-based 3D approaches while delivering superior visual results in 2D.
Citation
Z. Xie, F. Zhao, J. Zheng, X. Dong, F. Zhu and X. Liang, "Monocular-to-3D Virtual Try-On with Generative Semantic Articulated Fields," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3591072
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
3D Virtual Try-on, Generative Articulated Fields
Subjects
Source
Publisher
IEEE
