Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification

by   Zhiyin Shao, et al.

The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. However, there are two underlying inconsistencies between these two tasks that may impact the performance; i) Data inconsistency. A large domain gap exists between the generic images/texts used in public pre-trained models and the specific person data in the T2I-ReID task. This gap is especially severe for texts, as general textual data are usually unable to describe specific people in fine-grained detail. ii) Training inconsistency. The processes of pre-training of images and texts are independent, despite cross-modality learning being critical to T2I-ReID. To address the above issues, we present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task. We first build a large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual descriptions of images are automatically generated by the CLIP paradigm using a divide-conquer-combine strategy. Benefiting from this dataset, we then utilize a simple vision-and-language pre-training framework to explicitly align the feature space of the image and text modalities during pre-training. In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels. Without the need for any bells and whistles, our UniPT achieves competitive Rank-1 accuracy of, ie, 68.50 60.09 the LUPerson-T dataset and code are available at https;//github.com/ZhiyinShao-H/UniPT.


Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search

Text-based Person Search (TPS), is targeted on retrieving pedestrians to...

Unleashing the Potential of Unsupervised Pre-Training with Intra-Identity Regularization for Person Re-Identification

Existing person re-identification (ReID) methods typically directly load...

Large-Scale Pre-training for Person Re-identification with Noisy Labels

This paper aims to address the problem of pre-training for person re-ide...

LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Retrieval

Image-text retrieval (ITR) is a task to retrieve the relevant images/tex...

BiLMa: Bidirectional Local-Matching for Text-based Person Re-identification

Text-based person re-identification (TBPReID) aims to retrieve person im...

UKnow: A Unified Knowledge Protocol for Common-Sense Reasoning and Vision-Language Pre-training

This work presents a unified knowledge protocol, called UKnow, which fac...

FlipReID: Closing the Gap between Training and Inference in Person Re-Identification

Since neural networks are data-hungry, incorporating data augmentation i...

Please sign up or login with your details

Forgot password? Click here to reset