FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

05/27/2023
by   Zhuang Li, et al.
0

Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations. To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves state-of-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https://github.com/zhuang-li/FACTUAL .

READ FULL TEXT
research
03/25/2018

Scene Graph Parsing as Dependency Parsing

In this paper, we study the problem of parsing structured knowledge grap...
research
12/29/2020

Image-to-Image Retrieval by Learning Similarity between Scene Graphs

As a scene graph compactly summarizes the high-level content of an image...
research
04/02/2023

Learning Similarity between Scene Graphs and Images with Transformers

Scene graph generation is conventionally evaluated by (mean) Recall@K, w...
research
03/21/2023

CompoDiff: Versatile Composed Image Retrieval With Latent Diffusion

This paper proposes a novel diffusion-based model, CompoDiff, for solvin...
research
05/10/2023

Incorporating Structured Representations into Pretrained Vision Language Models Using Scene Graphs

Vision and Language (VL) models have demonstrated remarkable zero-shot p...
research
09/06/2021

GeneAnnotator: A Semi-automatic Annotation Tool for Visual Scene Graph

In this manuscript, we introduce a semi-automatic scene graph annotation...
research
11/01/2022

Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality

Recent visuolinguistic pre-trained models show promising progress on var...

Please sign up or login with your details

Forgot password? Click here to reset