Understanding, Categorizing and Predicting Semantic Image-Text Relations

by   Christian Otto, et al.

Two modalities are often used to convey information in a complementary and beneficial manner, e.g., in online news, videos, educational resources, or scientific publications. The automatic understanding of semantic correlations between text and associated images as well as their interplay has a great potential for enhanced multimodal web search and recommender systems. However, automatic understanding of multimodal information is still an unsolved research problem. Recent approaches such as image captioning focus on precisely describing visual content and translating it to text, but typically address neither semantic interpretations nor the specific role or purpose of an image-text constellation. In this paper, we go beyond previous work and investigate, inspired by research in visual communication, useful semantic image-text relations for multimodal information retrieval. We derive a categorization of eight semantic image-text classes (e.g., "illustration" or "anchorage") and show how they can systematically be characterized by a set of three metrics: cross-modal mutual information, semantic correlation, and the status relation of image and text. Furthermore, we present a deep learning system to predict these classes by utilizing multimodal embeddings. To obtain a sufficiently large amount of training data, we have automatically collected and augmented data from a variety of data sets and web resources, which enables future research on this topic. Experimental results on a demanding test set demonstrate the feasibility of the approach.


page 1

page 4

page 5


"Is this an example image?" -- Predicting the Relative Abstractness Level of Image and Text

Successful multimodal search and retrieval requires the automatic unders...

QuTI! Quantifying Text-Image Consistency in Multimodal Documents

The World Wide Web and social media platforms have become popular source...

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

The focus of this survey is on the analysis of two modalities of multimo...

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

The World Wide Web has become a popular source for gathering information...

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Text-to-image generation and image captioning are recently emerged as a ...

Predicting Knowledge Gain for MOOC Video Consumption

Informal learning on the Web using search engines as well as more struct...

Equal But Not The Same: Understanding the Implicit Relationship Between Persuasive Images and Text

Images and text in advertisements interact in complex, non-literal ways....

Please sign up or login with your details

Forgot password? Click here to reset