SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

by   Shanshan Zhong, et al.

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts. The code is released at


page 1

page 2

page 14

page 15


X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models

This paper introduces a novel explainable image quality evaluation appro...

Judge, Localize, and Edit: Ensuring Visual Commonsense Morality for Text-to-Image Generation

Text-to-image generation methods produce high-resolution and high-qualit...

GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation

Text-to-image (T2I) models based on diffusion processes have achieved re...

Painter: Teaching Auto-regressive Language Models to Draw Sketches

Large language models (LLMs) have made tremendous progress in natural la...

MultiFusion: Fusing Pre-Trained Models for Multi-Lingual, Multi-Modal Image Generation

The recent popularity of text-to-image diffusion models (DM) can largely...

Language-Oriented Communication with Semantic Coding and Knowledge Distillation for Text-to-Image Generation

By integrating recent advances in large language models (LLMs) and gener...

Creative Painting with Latent Diffusion Models

Artistic painting has achieved significant progress during recent years....

Please sign up or login with your details

Forgot password? Click here to reset