Scene Graph as Pivoting: Inference-time Image-free Unsupervised Multimodal Machine Translation with Visual Scene Hallucination

05/20/2023
by   Hao Fei, et al.
0

In this work, we investigate a more realistic unsupervised multimodal machine translation (UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text image pairs, and tested with only source-text inputs. First, we represent the input images and texts with the visual and language scene graphs (SG), where such fine-grained vision-language features ensure a holistic understanding of the semantics. To enable pure-text input during inference, we devise a visual scene hallucination mechanism that dynamically generates pseudo visual SG from the given textual SG. Several SG-pivoting based learning objectives are introduced for unsupervised translation training. On the benchmark Multi30K data, our SG-based method outperforms the best-performing baseline by significant BLEU scores on the task and setup, helping yield translations with better completeness, relevance and fluency without relying on paired images. Further in-depth analyses reveal how our model advances in the task setting.

READ FULL TEXT
research
06/01/2021

ViTA: Visual-Linguistic Translation by Aligning Object Tags

Multimodal Machine Translation (MMT) enriches the source text with visua...
research
10/10/2022

Distill the Image to Nowhere: Inversion Knowledge Distillation for Multimodal Machine Translation

Past works on multimodal machine translation (MMT) elevate bilingual set...
research
03/20/2019

Probing the Need for Visual Context in Multimodal Machine Translation

Current work on multimodal machine translation (MMT) has suggested that ...
research
05/31/2022

VALHALLA: Visual Hallucination for Machine Translation

Designing better machine translation systems by considering auxiliary in...
research
02/16/2023

Generalization algorithm of multimodal pre-training model based on graph-text self-supervised training

Recently, a large number of studies have shown that the introduction of ...
research
11/10/2022

Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense

Visual commonsense understanding requires Vision Language (VL) models to...
research
05/19/2023

Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling

Existing research on multimodal relation extraction (MRE) faces two co-e...

Please sign up or login with your details

Forgot password? Click here to reset