Despite their ability to generate high-resolution and diverse images from text prompts, text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However, previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training, either from real data or synthetically generated by the teacher model. In response to this limitation, we present a novel image-free distillation scheme named SwiftBrush. Drawing inspiration from text-to-3D synthesis, in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth, our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity, our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably, SwiftBrush achieves an FID score of 16.67 and a CLIP score of 0.29 on the COCO-30K benchmark, achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques.

Qualitative comparison of our work against exsiting distillation methods on COCO 2014.
Model	FID ↓	CLIP ↑
Image-depedent Distillation
Guided Distillation (Meng et al., 2023)	37.3	0.27
LCM (Luo et al., 2023)	35.56	0.24
InstaFlow (Liu et al., 2023)	13.27	0.28
Image-free Distillation
BOOT (Gu et al., 2023)	17.89	35.49
SwiftBrush (Our Work)	16.67	0.29
Stable Diffusion v2.1
1 sampling step	202.14	0.06
25 sampling steps	13.45	0.23

Special Thanks

We give thanks to Uy Dieu Tran for early discussions as well as providing many helpful comments and suggestions throughout the project. Special thanks to Trung Tuan Dao for valuable feedback and support. Last but not least, we thank Zhengyi Wang, Cheng Lu, Yikai Wang, Fan Bao, Chongxuan Li, Hang Su and Jun Zhu for the work of ProlificDreamer as well as Huggingface team for the diffusers framework.

BibTeX

@InProceedings{nguyen2024swiftbrush,
     title={SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation},
     author={Thuan Hoang Nguyen and Anh Tran},
     booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
     year={2024},
}

SwiftBrush

One-Step Text-to-Image Diffusion Model

with Variational Score Distillation

One-Step Text-to-Image Diffusion Model

with Variational Score Distillation

Thuan Hoang Nguyen Anh Tran

Thuan Hoang Nguyen Anh Tran

VinAI Research

VinAI Research

CVPR 2024

CVPR 2024

SwiftBrush research highlights

Click on a word below and brush swiftly!

SwiftBrush

speedy · simple · sublime

Special Thanks

BibTeX