Universal Captioner: Inducing Content-Style Separation in Vision-and-Language Model Training

11/24/2021
by   Marcella Cornia, et al.
1

While captioning models have obtained compelling results in describing natural images, there is a growing effort to increase their capability of dealing with real-world concepts. In this paper, we address the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human- and automatically-collected captions. To this end, we propose a model which induces a separation between content and descriptive style through the incorporation of stylistic parameters and keywords extracted from large-scale multi-modal models as pivotal data. In terms of visual features, our model avoids the need of object detectors and employs grid-like features together with a single objective of prompt language modeling. Experimentally, we consistently outperform existing methods in terms of caption quality and capability of describing out-of-domain concepts. Finally, our model obtains a new state of the art on both COCO and nocaps.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset

Sign in with Google

×

Use your Google Account to sign in to DeepAI

×

Consider DeepAI Pro