Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities

by   Hexiang Hu, et al.

Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN-Wiki by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with the largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pretrained models yield different strengths: while PaLI-based models obtain higher overall performance, CLIP-based models are better at recognizing tail entities.


page 1

page 9

page 15


Can Pre-trained Vision and Language Models Answer Visual Information-Seeking Questions?

Large language models have demonstrated an emergent capability in answer...

Software Entity Recognition with Noise-Robust Learning

Recognizing software entities such as library names from free-form text ...

Reconciliation of Pre-trained Models and Prototypical Neural Networks in Few-shot Named Entity Recognition

Incorporating large-scale pre-trained models with the prototypical neura...

Hypernymization of named entity-rich captions for grounding-based multi-modal pretraining

Named entities are ubiquitous in text that naturally accompanies images,...

Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models

In this paper, we propose a table and image generation task to verify ho...

EnCore: Pre-Training Entity Encoders using Coreference Chains

Entity typing is the task of assigning semantic types to the entities th...

SWAT: A System for Detecting Salient Wikipedia Entities in Texts

We study the problem of entity salience by proposing the design and impl...

Please sign up or login with your details

Forgot password? Click here to reset