PLIP: Language-Image Pre-training for Person Representation Learning

by   Jialong Zuo, et al.

Pre-training has emerged as an effective technique for learning powerful person representations. Most existing methods have shown that pre-training on pure-vision large-scale datasets like ImageNet and LUPerson has achieved remarkable performance. However, solely relying on visual information, the absence of robust explicit indicators poses a challenge for these methods to learn discriminative person representations. Drawing inspiration from the intrinsic fine-grained attribute indicators of person descriptions, we explore introducing the language modality into person representation learning. To this end, we propose a novel language-image pre-training framework for person representation learning, termed PLIP. To explicitly build fine-grained cross-modal associations, we specifically design three pretext tasks, semantic-fused image colorization, visual-fused attributes prediction, and vision-language matching. In addition, due to the lack of an appropriate dataset, we present a large-scale person dataset named SYNTH-PEDES, where the Stylish Pedestrian Attributes-union Captioning method is proposed to synthesize diverse textual descriptions. We pre-train PLIP on SYNTH-PEDES and evaluate our model by spanning downstream tasks such as text-based Re-ID, image-based Re-ID, and person attribute recognition. Extensive experiments demonstrate that our model not only significantly improves existing methods on all these tasks, but also shows great ability in the few-shot and domain generalization settings. The code, dataset and weights will be released at <>


Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Text-based Person Search (TPS), is targeted on retrieving pedestrians to...

FashionViL: Fashion-Focused Vision-and-Language Representation Learning

Large-scale Vision-and-Language (V+L) pre-training for representation le...

CLIP Meets Video Captioners: Attribute-Aware Representation Learning Promotes Accurate Captioning

For video captioning, "pre-training and fine-tuning" has become a de fac...

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

In this paper, we introduce a large Multi-Attribute and Language Search ...

Multi-task Collaborative Pre-training and Individual-adaptive-tokens Fine-tuning: A Unified Framework for Brain Representation Learning

Structural magnetic resonance imaging (sMRI) provides accurate estimates...

Enhancing CLIP with GPT-4: Harnessing Visual Descriptions as Prompts

Contrastive pretrained large Vision-Language Models (VLMs) like CLIP hav...

SM+: Refined Scale Match for Tiny Person Detection

Detecting tiny objects ( e.g., less than 20 x 20 pixels) in large-scale ...

Please sign up or login with your details

Forgot password? Click here to reset