DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability

08/18/2023
by   Runhui Huang, et al.
0

Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65 improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis.

READ FULL TEXT

page 1

page 8

page 12

page 13

research
02/14/2022

Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

This paper presents a large-scale Chinese cross-modal dataset for benchm...
research
06/01/2023

UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning

Recent advances in vision-language pre-training have enabled machines to...
research
03/28/2023

Your Diffusion Model is Secretly a Zero-Shot Classifier

The recent wave of large-scale text-to-image diffusion models has dramat...
research
07/11/2023

Diffusion idea exploration for art generation

Cross-Modal learning tasks have picked up pace in recent times. With ple...
research
08/16/2022

Your ViT is Secretly a Hybrid Discriminative-Generative Diffusion Model

Diffusion Denoising Probability Models (DDPM) and Vision Transformer (Vi...
research
06/16/2023

Energy-Based Cross Attention for Bayesian Context Update in Text-to-Image Diffusion Models

Despite the remarkable performance of text-to-image diffusion models in ...
research
05/26/2021

CogView: Mastering Text-to-Image Generation via Transformers

Text-to-Image generation in the general domain has long been an open pro...

Please sign up or login with your details

Forgot password? Click here to reset