Human Evaluation of Text-to-Image Models on a Multi-Task Benchmark

by   Vitali Petsiuk, et al.

We provide a new multi-task benchmark for evaluating text-to-image models. We perform a human evaluation comparing the most common open-source (Stable Diffusion) and commercial (DALL-E 2) models. Twenty computer science AI graduate students evaluated the two models, on three tasks, at three difficulty levels, across ten prompts each, providing 3,600 ratings. Text-to-image generation has seen rapid progress to the point that many recent models have demonstrated their ability to create realistic high-resolution images for various prompts. However, current text-to-image methods and the broader body of research in vision-language understanding still struggle with intricate text prompts that contain many objects with multiple attributes and relationships. We introduce a new text-to-image benchmark that contains a suite of thirty-two tasks over multiple applications that capture a model's ability to handle different features of a text prompt. For example, asking a model to generate a varying number of the same object to measure its ability to count or providing a text prompt with several objects that each have a different attribute to identify its ability to match objects and attributes correctly. Rather than subjectively evaluating text-to-image results on a set of prompts, our new multi-task benchmark consists of challenge tasks at three difficulty levels (easy, medium, and hard) and human ratings for each generated image.


page 7

page 8

page 9


T2I-CompBench: A Comprehensive Benchmark for Open-world Compositional Text-to-image Generation

Despite the stunning ability to generate high-quality images by recent t...

SynMotor: A Benchmark Suite for Object Attribute Regression and Multi-task Learning

In this paper, we develop a novel benchmark suite including both a 2D sy...

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

Most existing video-and-language (VidL) research focuses on a single dat...

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models

The ability to understand visual concepts and replicate and compose thes...

Text Difficulty Study: Do machines behave the same as humans regarding text difficulty?

Given a task, human learns from easy to hard, whereas the model learns r...

Semantic Object Accuracy for Generative Text-to-Image Synthesis

Generative adversarial networks conditioned on simple textual image desc...

TeTIm-Eval: a novel curated evaluation data set for comparing text-to-image models

Evaluating and comparing text-to-image models is a challenging problem. ...

Please sign up or login with your details

Forgot password? Click here to reset