On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

by   Shruti Palaskar, et al.

Integrating vision and language has gained notable attention following the success of pretrained language models. Despite that, a fraction of emerging multimodal models is suitable for text generation conditioned on images. This minority is typically developed and evaluated for image captioning, a text generation task conditioned solely on images with the goal to describe what is explicitly visible in an image. In this paper, we take a step back and ask: How do these models work for more complex generative tasks, conditioned on both text and images? Are models based on joint multimodal pretraining, visually adapted pretrained language models, or models that combine these two approaches, more promising for such tasks? We address these questions in the context of self-rationalization (jointly generating task labels/answers and free-text explanations) of three tasks: (i) visual question answering in VQA-X, (ii) visual commonsense reasoning in VCR, and (iii) visual-textual entailment in E-SNLI-VE. We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. We also observe that no model type works universally the best across tasks/datasets and finetuning data sizes. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.


Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

Integrating outside knowledge for reasoning in visio-linguistic tasks su...

What You See is What You Read? Improving Text-Image Alignment Evaluation

Automatically determining whether a text and a corresponding image are s...

Rethinking Evaluation Practices in Visual Question Answering: A Case Study on Out-of-Distribution Generalization

Vision-and-language (V L) models pretrained on large-scale multimodal ...

I Can't Believe There's No Images! Learning Visual Tasks Using only Language Data

Many high-level skills that are required for computer vision tasks, such...

Data Curation for Image Captioning with Text-to-Image Generative Models

Recent advances in image captioning are mainly driven by large-scale vis...

Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization

Multimodal abstractive summarization (MAS) models that summarize videos ...

Text Conditional Alt-Text Generation for Twitter Images

In this work we present an approach for generating alternative text (or ...

Please sign up or login with your details

Forgot password? Click here to reset