Visual Programming for Text-to-Image Generation and Evaluation

05/24/2023
by   Jaemin Cho, et al.
8

As large language models have demonstrated impressive performance in many domains, recent works have adopted language models (LMs) as controllers of visual modules for vision-and-language tasks. While existing work focuses on equipping LMs with visual understanding, we propose two novel interpretable/explainable visual programming frameworks for text-to-image (T2I) generation and evaluation. First, we introduce VPGen, an interpretable step-by-step T2I generation framework that decomposes T2I generation into three steps: object/count generation, layout generation, and image generation. We employ an LM to handle the first two steps (object/count generation and layout generation), by finetuning it on text-layout pairs. Our step-by-step T2I generation framework provides stronger spatial control than end-to-end models, the dominant approach for this task. Furthermore, we leverage the world knowledge of pretrained LMs, overcoming the limitation of previous layout-guided T2I works that can only handle predefined object classes. We demonstrate that our VPGen has improved control in counts/spatial relations/scales of objects than state-of-the-art T2I generation models. Second, we introduce VPEval, an interpretable and explainable evaluation framework for T2I generation based on visual programming. Unlike previous T2I evaluations with a single scoring model that is accurate in some skills but unreliable in others, VPEval produces evaluation programs that invoke a set of visual modules that are experts in different skills, and also provides visual+textual explanations of the evaluation results. Our analysis shows VPEval provides a more human-correlated evaluation for skill-specific and open-ended prompts than widely used single model-based evaluation. We hope our work encourages future progress on interpretable/explainable generation and evaluation for T2I models. Website: https://vp-t2i.github.io

READ FULL TEXT

page 2

page 5

page 14

page 15

research
04/13/2023

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

Spatial control is a core capability in controllable image generation. A...
research
02/08/2022

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Generating images from textual descriptions has gained a lot of attentio...
research
05/18/2023

X-IQE: eXplainable Image Quality Evaluation for Text-to-Image Generation with Visual Large Language Models

This paper introduces a novel explainable image quality evaluation appro...
research
05/24/2023

LayoutGPT: Compositional Visual Planning and Generation with Large Language Models

Attaining a high degree of user controllability in visual generation oft...
research
05/25/2023

Towards Language-guided Interactive 3D Generation: LLMs as Layout Interpreter with Generative Feedback

Generating and editing a 3D scene guided by natural language poses a cha...
research
04/11/2023

HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models

In recent years, Text-to-Image (T2I) models have been extensively studie...
research
07/18/2023

Let's ViCE! Mimicking Human Cognitive Behavior in Image Generation Evaluation

Research in Image Generation has recently made significant progress, par...

Please sign up or login with your details

Forgot password? Click here to reset