Embedded Heterogeneous Attention Transformer for Cross-lingual Image Captioning

by   Zijie Song, et al.

Cross-lingual image captioning is confronted with both cross-lingual and cross-modal challenges for multimedia analysis. The crucial issue in this task is to model the global and local matching between the image and different languages. Existing cross-modal embedding methods based on Transformer architecture oversight the local matching between the image region and monolingual words, not to mention in the face of a variety of differentiated languages. Due to the heterogeneous property of the cross-modal and cross-lingual task, we utilize the heterogeneous network to establish cross-domain relationships and the local correspondences between the image and different languages. In this paper, we propose an Embedded Heterogeneous Attention Transformer (EHAT) to build reasoning paths bridging cross-domain for cross-lingual image captioning and integrate into transformer. The proposed EHAT consists of a Masked Heterogeneous Cross-attention (MHCA), Heterogeneous Attention Reasoning Network (HARN) and Heterogeneous Co-attention (HCA). HARN as the core network, models and infers cross-domain relationship anchored by vision bounding box representation features to connect two languages word features and learn the heterogeneous maps. MHCA and HCA implement cross-domain integration in the encoder through the special heterogeneous attention and enable single model to generate two language captioning. We test on MSCOCO dataset to generate English and Chinese, which are most widely used and have obvious difference between their language families. Our experiments show that our method even achieve better than advanced monolingual methods.


page 1

page 4

page 8


Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment

Unpaired cross-lingual image captioning has long suffered from irrelevan...

Fluency-Guided Cross-Lingual Image Captioning

Image captioning has so far been explored mostly in English, as most ava...

Unsupervised Cross-lingual Image Captioning

Most recent image captioning works are conducted in English as the major...

Unpaired Cross-lingual Image Caption Generation with Self-Supervised Rewards

Generating image descriptions in different languages is essential to sat...

Cross-lingual Hate Speech Detection using Transformer Models

Hate speech detection within a cross-lingual setting represents a paramo...

GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and Event Extraction

Prevalent approaches in cross-lingual relation and event extraction use ...

Cross-modal Language Generation using Pivot Stabilization for Web-scale Language Coverage

Cross-modal language generation tasks such as image captioning are direc...

Please sign up or login with your details

Forgot password? Click here to reset