Exploring Overall Contextual Information for Image Captioning in Human-Like Cognitive Style

10/15/2019
by   Hongwei Ge, et al.
0

Image captioning is a research hotspot where encoder-decoder models combining convolutional neural network (CNN) and long short-term memory (LSTM) achieve promising results. Despite significant progress, these models generate sentences differently from human cognitive styles. Existing models often generate a complete sentence from the first word to the end, without considering the influence of the following words on the whole sentence generation. In this paper, we explore the utilization of a human-like cognitive style, i.e., building overall cognition for the image to be described and the sentence to be constructed, for enhancing computer image understanding. This paper first proposes a Mutual-aid network structure with Bidirectional LSTMs (MaBi-LSTMs) for acquiring overall contextual information. In the training process, the forward and backward LSTMs encode the succeeding and preceding words into their respective hidden states by simultaneously constructing the whole sentence in a complementary manner. In the captioning process, the LSTM implicitly utilizes the subsequent semantic information contained in its hidden states. In fact, MaBi-LSTMs can generate two sentences in forward and backward directions. To bridge the gap between cross-domain models and generate a sentence with higher quality, we further develop a cross-modal attention mechanism to retouch the two sentences by fusing their salient parts as well as the salient areas of the image. Experimental results on the Microsoft COCO dataset show that the proposed model improves the performance of encoder-decoder models and achieves state-of-the-art results.

READ FULL TEXT
research
09/03/2022

vieCap4H-VLSP 2021: Vietnamese Image Captioning for Healthcare Domain using Swin Transformer and Attention-based LSTM

This study presents our approach on the automatic Vietnamese image capti...
research
04/04/2016

Image Captioning with Deep Bidirectional LSTMs

This work presents an end-to-end trainable deep bidirectional LSTM (Long...
research
12/26/2018

Hierarchical LSTMs with Adaptive Attention for Visual Captioning

Recent progress has been made in using attention based encoder-decoder f...
research
01/20/2021

Macroscopic Control of Text Generation for Image Captioning

Despite the fact that image captioning models have been able to generate...
research
05/23/2018

CNN+CNN: Convolutional Decoders for Image Captioning

Image captioning is a challenging task that combines the field of comput...
research
12/15/2016

Recurrent Image Captioner: Describing Images with Spatial-Invariant Transformation and Attention Filtering

Along with the prosperity of recurrent neural network in modelling seque...
research
10/29/2017

A Neural-Symbolic Approach to Natural Language Tasks

Deep learning (DL) has in recent years been widely used in natural langu...

Please sign up or login with your details

Forgot password? Click here to reset