Item

SnapGen: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training

Chen, Jierun
Hu, Dongting
Huang, Xijie
Coskun, Huseyin
Sahni, Arpit
Gupta, Aarush
Goyal, Anujraaj
Lahiri, Dishani
Singh, Rajesh
Idelbayev, Yerlan
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Existing text-to-image (T2I) diffusion models face several limitations, including large model sizes, slow runtime, and low-quality generation on mobile devices. This paper aims to address all of these challenges by developing an extremely small and fast T2I model that generates high-resolution and high-quality images on mobile platforms. We propose several techniques to achieve this goal. First, we systematically examine the design choices of the network architecture to reduce model parameters and latency, while ensuring high-quality generation. Second, to further improve generation quality, we employ cross-architecture knowledge distillation from a much larger model, using a multi-level approach to guide the training of our model from scratch. Third, we enable a few-step generation by integrating adversarial guidance with knowledge distillation. For the first time, our model SnapGen, demonstrates the generation of 10242 px images on a mobile device around 1.4 seconds. On ImageNet-1K, our model, with only 372M parameters, achieves an FID of 2.06 for 2562 px generation. On T2I benchmarks (i.e., GenEval and DPG-Bench), our model with merely 379M parameters, surpasses large-scale models with billions of parameters at a significantly smaller size (e.g., 7× smaller than SDXL, 14× smaller than IF-XL).
Citation
J. Chen et al., "SnapGen: Taming High-Resolution Text-To-Image Models for Mobile Devices with Efficient Architectures and Training," 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025, pp. 7997-8008, doi: 10.1109/CVPR52734.2025.00749.
Source
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Conference
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Keywords
Efficient Architecture, Generative Models, Mobile Text-to-image Models
Subjects
Source
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Publisher
IEEE
Full-text link