Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

by   Shuyu Yang, et al.

In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 times larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.96 CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively.


page 1

page 8

page 11


Learning Transferable Pedestrian Representation from Multimodal Information Supervision

Recent researches on unsupervised person re-identification (reID) have d...

Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Text-based Person Search (TPS), is targeted on retrieving pedestrians to...

PLIP: Language-Image Pre-training for Person Representation Learning

Pre-training has emerged as an effective technique for learning powerful...

ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data

Vision-language models (VLMs), such as CLIP and ALIGN, are generally tra...

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Image-text retrieval (ITR) is a task to retrieve the relevant images/tex...

TaCo: Textual Attribute Recognition via Contrastive Learning

As textual attributes like font are core design elements of document for...

UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

This work presents a unified knowledge protocol, called UKnow, which fac...

Please sign up or login with your details

Forgot password? Click here to reset