GEST: the Graph of Events in Space and Time as a Common Representation between Vision and Language

05/22/2023
by   Mihai Masala, et al.
0

One of the essential human skills is the ability to seamlessly build an inner representation of the world. By exploiting this representation, humans are capable of easily finding consensus between visual, auditory and linguistic perspectives. In this work, we set out to understand and emulate this ability through an explicit representation for both vision and language - Graphs of Events in Space and Time (GEST). GEST alows us to measure the similarity between texts and videos in a semantic and fully explainable way, through graph matching. It also allows us to generate text and videos from a common representation that provides a well understood content. In this work we show that the graph matching similarity metrics based on GEST outperform classical text generation metrics and can also boost the performance of state of art, heavily trained metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/11/2023

Learning Grounded Vision-Language Representation for Versatile Understanding in Untrimmed Videos

Joint video-language learning has received increasing attention in recen...
research
05/23/2023

INSTRUCTSCORE: Towards Explainable Text Generation Evaluation with Automatic Feedback

The field of automatic evaluation of text generation made tremendous pro...
research
01/16/2019

A Functional Representation for Graph Matching

Graph matching is an important and persistent problem in computer vision...
research
06/05/2018

Videos as Space-Time Region Graphs

How do humans recognize the action "opening a book" ? We argue that ther...
research
04/30/2020

NUBIA: NeUral Based Interchangeability Assessor for Text Generation

We present NUBIA, a methodology to build automatic evaluation metrics fo...
research
04/19/2021

What can human minimal videos tell us about dynamic recognition models?

In human vision objects and their parts can be visually recognized from ...
research
09/09/2023

FaNS: a Facet-based Narrative Similarity Metric

Similar Narrative Retrieval is a crucial task since narratives are essen...

Please sign up or login with your details

Forgot password? Click here to reset