Pay Attention to Those Sets! Learning Quantification from Images

by   Ionut Sorodoc, et al.

Major advances have recently been made in merging language and vision representations. But most tasks considered so far have confined themselves to the processing of objects and lexicalised relations amongst objects (content words). We know, however, that humans (even pre-school children) can abstract over raw data to perform certain types of higher-level reasoning, expressed in natural language by function words. A case in point is given by their ability to learn quantifiers, i.e. expressions like 'few', 'some' and 'all'. From formal semantics and cognitive linguistics, we know that quantifiers are relations over sets which, as a simplification, we can see as proportions. For instance, in 'most fish are red', most encodes the proportion of fish which are red fish. In this paper, we study how well current language and vision strategies model such relations. We show that state-of-the-art attention mechanisms coupled with a traditional linguistic formalisation of quantifiers gives best performance on the task. Additionally, we provide insights on the role of 'gist' representations in quantification. A 'logical' strategy to tackle the task would be to first obtain a numerosity estimation for the two involved sets and then compare their cardinalities. We however argue that precisely identifying the composition of the sets is not only beyond current state-of-the-art models but perhaps even detrimental to a task that is most efficiently performed by refining the approximate numerosity estimator of the system.


page 2

page 8

page 13

page 18

page 23


Text-based NP Enrichment

Understanding the relations between entities denoted by NPs in text is a...

Visual Reasoning with Natural Language

Natural language provides a widely accessible and expressive interface f...

What does CLIP know about a red circle? Visual prompt engineering for VLMs

Large-scale Vision-Language Models, such as CLIP, learn powerful image-t...

Learning the meanings of function words from grounded language using a visual question answering model

Interpreting a seemingly-simple function word like "or", "behind", or "m...

Stories in the Eye: Contextual Visual Interactions for Efficient Video to Language Translation

Integrating higher level visual and linguistic interpretations is at the...

Resolving References to Objects in Photographs using the Words-As-Classifiers Model

A common use of language is to refer to visually present objects. Modell...

The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations

We introduce a new test of how well language models capture meaning in c...

Please sign up or login with your details

Forgot password? Click here to reset