Controllable Image Generation via Collage Representations

by   Arantxa Casanova, et al.

Recent advances in conditional generative image models have enabled impressive results. On the one hand, text-based conditional models have achieved remarkable generation quality, by leveraging large-scale datasets of image-text pairs. To enable fine-grained controllability, however, text-based models require long prompts, whose details may be ignored by the model. On the other hand, layout-based conditional models have also witnessed significant advances. These models rely on bounding boxes or segmentation maps for precise spatial conditioning in combination with coarse semantic labels. The semantic labels, however, cannot be used to express detailed appearance characteristics. In this paper, we approach fine-grained scene controllability through image collages which allow a rich visual description of the desired scene as well as the appearance and location of the objects therein, without the need of class nor attribute labels. We introduce "mixing and matching scenes" (M Ms), an approach that consists of an adversarially trained generative image model which is conditioned on appearance features and spatial positions of the different elements in a collage, and integrates these into a coherent image. We train our model on the OpenImages (OI) dataset and evaluate it on collages derived from OI and MS-COCO datasets. Our experiments on the OI dataset show that M Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity. On the MS-COCO dataset, we highlight the generalization ability of our model by outperforming DALL-E in terms of the zero-shot FID metric, despite using two magnitudes fewer parameters and data. Collage based generative models have the potential to advance content creation in an efficient and effective way as they are intuitive to use and yield high quality generations.


page 2

page 21

page 22

page 26

page 27

page 28

page 29

page 31


Zero-shot spatial layout conditioning for text-to-image diffusion models

Large-scale text-to-image diffusion models have significantly improved t...

Generating Multiple Objects at Spatially Distinct Locations

Recent improvements to Generative Adversarial Networks (GANs) have made ...

ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts

Recent progress in diffusion models has revolutionized the popular techn...

X Fuse: Fusing Visual Information in Text-to-Image Generation

We introduce X Fuse, a general approach for conditioning on visual inf...

Boosting Text-to-Image Diffusion Models with Fine-Grained Semantic Rewards

Recent advances in text-to-image diffusion models have achieved remarkab...

Controllable and Progressive Image Extrapolation

Image extrapolation aims at expanding the narrow field of view of a give...

On the Evaluation of Conditional GANs

Conditional Generative Adversarial Networks (cGANs) are finding increasi...

Please sign up or login with your details

Forgot password? Click here to reset