Saliency-Guided Attention Network for Image-Sentence Matching

04/20/2019
by   Zhong Ji, et al.
0

This paper studies the task of matching image and sentence, where learning appropriate representations across the multi-modal data appears to be the main challenge. Unlike previous approaches that predominantly deploy symmetrical architecture to represent both modalities, we propose Saliency-guided Attention Network (SAN) that asymmetrically employs visual and textual attention modules to learn the fine-grained correlation intertwined between vision and language. The proposed SAN mainly includes three components: saliency detector, Saliency-weighted Visual Attention (SVA) module, and Saliency-guided Textual Attention (STA) module. Concretely, the saliency detector provides the visual saliency information as the guidance for the two attention modules. SVA is designed to leverage the advantage of the saliency information to improve discrimination of visual representations. By fusing the visual information from SVA and textual information as a multi-modal guidance, STA learns discriminative textual representations that are highly sensitive to visual clues. Extensive experiments demonstrate SAN can substantially improve the state-of-the-art results on the benchmark Flickr30K and MSCOCO datasets by a large margin.

READ FULL TEXT

page 7

page 8

page 12

page 13

page 14

research
04/03/2017

AMC: Attention guided Multi-modal Correlation Learning for Image Search

Given a user's query, traditional image search systems rank images accor...
research
05/27/2019

Dynamically Visual Disambiguation of Keyword-based Image Search

Due to the high cost of manual annotation, learning directly from the we...
research
04/20/2022

Situational Perception Guided Image Matting

Most automatic matting methods try to separate the salient foreground fr...
research
06/16/2023

M3PT: A Multi-Modal Model for POI Tagging

POI tagging aims to annotate a point of interest (POI) with some informa...
research
06/03/2022

Adversarial Attacks on Human Vision

This article presents an introduction to visual attention retargeting, i...
research
12/16/2022

Robust Saliency Guidance for Data-free Class Incremental Learning

Data-Free Class Incremental Learning (DFCIL) aims to sequentially learn ...
research
08/09/2023

Decoding Layer Saliency in Language Transformers

In this paper, we introduce a strategy for identifying textual saliency ...

Please sign up or login with your details

Forgot password? Click here to reset