Progressive Text-to-Image Diffusion with Soft Latent Direction

by   Yuteng Ye, et al.

In spite of the rapidly evolving landscape of text-to-image generation, the synthesis and manipulation of multiple entities while adhering to specific relational constraints pose enduring challenges. This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image, ensuring their adherence to spatial and relational constraints at each sequential step. Our key insight stems from the observation that while a pre-trained text-to-image diffusion model adeptly handles one or two entities, it often falters when dealing with a greater number. To address this limitation, we propose harnessing the capabilities of a Large Language Model (LLM) to decompose intricate and protracted text descriptions into coherent directives adhering to stringent formats. To facilitate the execution of directives involving distinct semantic operations-namely insertion, editing, and erasing-we formulate the Stimulus, Response, and Fusion (SRF) framework. Within this framework, latent regions are gently stimulated in alignment with each operation, followed by the fusion of the responsive latent components to achieve cohesive entity manipulation. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs. Consequently, it establishes a new benchmark for text-to-image generation tasks, further elevating the field's performance standards.


page 1

page 6

page 7

page 10

page 11

page 12

page 13

page 14


Zero-shot Generation of Coherent Storybook from Plain Text Story using Diffusion Models

Recent advancements in large scale text-to-image models have opened new ...

Bridging CLIP and StyleGAN through Latent Alignment for Image Editing

Text-driven image manipulation is developed since the vision-language mo...

Towards Open-World Text-Guided Face Image Generation and Manipulation

The existing text-guided image synthesis methods can only produce limite...

MDP: A Generalized Framework for Text-Guided Image Editing by Manipulating the Diffusion Path

Image generation using diffusion can be controlled in multiple ways. In ...

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

The recent popularity of text-to-image diffusion models (DM) can largely...

ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise Semantic Alignment and Generation

Existing text-guided image manipulation methods aim to modify the appear...

Please sign up or login with your details

Forgot password? Click here to reset