Flow Matching in Latent Space

1VinAI Research2National University of Singapore
*Equal contribution

Input data is first encoded to produce the latent vector \( z_0 \) which is used to trained an velocity estimator of the transformation from a standard normal distribution \( p(z_1) = \mathcal{N}(0, \mathbf{I}) \) to the target latent distribution \( p(z_0) \).

Sampling process starts from random noise \( z_1 \), the trained network is used to predict the velocity towards the target latent distribution \( p(z_0) \) through numerical integration. Finally, \( z_0 \) is decoded to generate the corresponding image.


Class-conditional image generation on ImageNet

Abstract

Flow matching is a recent framework to train generative models that exhibits impressive empirical performance while being relatively easier to train compared with diffusion-based models. Despite its advantageous properties, prior methods still face the challenges of expensive computing and a large number of function evaluations of off-the-shelf solvers in the pixel space. Furthermore, although latent-based generative methods have shown great success in recent years, this particular model type remains underexplored in this area. In this work, we propose to apply flow matching in the latent spaces of pretrained autoencoders, which offers improved computational efficiency and scalability for high-resolution image synthesis. This enables flow-matching training on constrained computational resources while maintaining their quality and flexibility. Additionally, our work stands as a pioneering contribution in the integration of various conditions into flow matching for conditional generation tasks, including label-conditioned image generation, image inpainting, and semantic-to-image generation. Through extensive experiments, our approach demonstrates its effectiveness in both quantitative and qualitative results on various datasets, such as CelebA-HQ, FFHQ, LSUN Church & Bedroom, and ImageNet. We also provide a theoretical control of the Wasserstein-2 distance between the reconstructed latent flow distribution and true data distribution, showing it is upper-bounded by the latent flow matching objective.

Motivation

Flow Matching (FM) has revolutionized the training of continuous normalizing flows (CNF) by introducing a simulation-free approach. Recent works, based on optimal transport theory, simplifies the training objective by assuming a constant velocity field between data and noise. This simplification results in straighter probability paths compared to the high-curvature paths in diffusion models, making FM an efficient approach for training continuous normalizing flows (CNF).

However, FM is still in its early stages and not yet ready for high-resolution image synthesis due to its costly ODE sampling process. Our approach represents a pioneering effort to thoroughly integrate and study latent representations for flow-matching models, with the aim of enhancing both scalability and performance.

Besides, the application of flow-based models in class-conditional generation remain unexplored. Hence, we introduce classifier-free velocity field, inspired by the concept of classifier-free guidance in diffusion models. Despite employing the same technique, our approach distinguishes itself by relying on a velocity field instead of noise, leading to a distinctive method for class-conditional generation. Additionally, our method supports different types of conditions, enabling tasks such as image inpainting and mask-to-image generation.

Why opt for the latent space?

Efficient computing: The most compelling advantage of the latent space is its compact representation, which enables smaller spatial dimensions. This property greatly benefits high-resolution synthesis by reducing the computational cost associated with evaluating numerical solvers at each step.

Expressivity: As found in prior latent-based diffusion models, training a generative model on latent variables typically lead to the improvement in expressivity, thereby enhancing model performance. Based upon these findings, we expect that our model, through the combination of FM and a carefully selected latent framework, will exhibit heightened expressivity.

As a result, the latent space empowers efficient training and sampling while enabling the generation of high-quality outputs.

Method

Unconditional model

The training and sampling procedure of unconditional model are described as follow. During training, the model is trained to estimate the transformation velocity from current state \( z_t \) to data distribution, which is constrained by the direct flow between data and noise \( z_1 - z_0 \).

Class-conditional model

Unlike unconditional model, class-conditional model requires an extra class label in the training procedure. During training, we also incorporate the training of unconditional model with probability \( p_{u} \) (e.g. 0.1) to preserve generation diversity while alleviating the overfitting problem.

Downstream tasks

Inpainting
Mask-to-image

The first row is the input images, the second row is our generated images and the last row is the reference images.

Theoretical analysis: Bounding estimation error

We have shown that minimizing the FM objective on latent space controls the Wasserstein distance between the target density \( p_0 \) and the reconstructed density \( \hat{p}_0 \), which coincides with Fréchet inception distance (FID), a common metric for image generation. This means that our latent flow matching is guaranteed to control this metric, given reasonable estimation of \( \hat{v}(\mathbf{z}_t, t) \). Nonetheless, the analysis also suggests that the quality of latent flow matching depends on the constants that define the expressivity of the decoders and encoders, which has been observed in prior research on generative modeling in latent space.

BibTeX

@article{dao2023lfm
   author = {Quan Dao and Hao Phung and Binh Nguyen and Anh Tran},
   title = {Flow Matching in Latent Space},
   journal = {arXiv preprint arXiv:2307.08698},
   year = {2023},
}