Adversarial Representation Learning for Text-to-Image Matching

08/28/2019
by   Nikolaos Sarafianos, et al.
23

For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publicly-available language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-the-art cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2

READ FULL TEXT

page 8

page 14

research
09/20/2020

Dual-path CNN with Max Gated block for Text-Based Person Re-identification

Text-based person re-identification(Re-id) is an important task in video...
research
09/09/2021

Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers

Pretrained vision-and-language BERTs aim to learn representations that c...
research
07/31/2018

Deep Cross Modal Learning for Caricature Verification and Identification(CaVINet)

Learning from different modalities is a challenging task. In this paper,...
research
02/02/2019

Collaborative Quantization for Cross-Modal Similarity Search

Cross-modal similarity search is a problem about designing a search syst...
research
10/11/2019

Cross-modal Scene Graph Matching for Relationship-aware Image-Text Retrieval

Image-text retrieval of natural scenes has been a popular research topic...
research
04/27/2020

A Novel Attention-based Aggregation Function to Combine Vision and Language

The joint understanding of vision and language has been recently gaining...
research
11/15/2017

Dual-Path Convolutional Image-Text Embedding with Instance Loss

Matching images and sentences demands a fine understanding of both modal...

Please sign up or login with your details

Forgot password? Click here to reset