Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

by   Xunlin Zhan, et al.

Nowadays, customer's demands for E-commerce are more diversified, which introduces more complications to the product retrieval industry. Previous methods are either subject to single-modal input or perform supervised image-level product retrieval, thus fail to accommodate real-life scenarios where enormous weakly annotated multi-modal data are present. In this paper, we investigate a more realistic setting that aims to perform weakly-supervised multi-modal instance-level product retrieval among fine-grained product categories. To promote the study of this challenging task, we contribute Product1M, one of the largest multi-modal cosmetic datasets for real-world instance-level retrieval. Notably, Product1M contains over 1 million image-caption pairs and consists of two sample types, i.e., single-product and multi-product samples, which encompass a wide variety of cosmetics brands. In addition to the great diversity, Product1M enjoys several appealing characteristics including fine-grained categories, complex combinations, and fuzzy correspondence that well mimic the real-world scenes. Moreover, we propose a novel model named Cross-modal contrAstive Product Transformer for instance-level prodUct REtrieval (CAPTURE), that excels in capturing the potential synergy between multi-modal inputs via a hybrid-stream transformer in a self-supervised manner.CAPTURE generates discriminative instance features via masked multi-modal learning as well as cross-modal contrastive pretraining and it outperforms several SOTA cross-modal baselines. Extensive ablation studies well demonstrate the effectiveness and the generalization capacity of our model.


page 3

page 4

page 8


Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

Our goal in this research is to study a more realistic environment in wh...

Fine-Grained Instance-Level Sketch-Based Video Retrieval

Existing sketch-analysis work studies sketches depicting static objects ...

Contrastive Video Question Answering via Video Graph Transformer

We propose to perform video question answering (VideoQA) in a Contrastiv...

Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-commerce

This paper aims to establish a generic multi-modal foundation model that...

General-Purpose Multi-Modal OOD Detection Framework

Out-of-distribution (OOD) detection identifies test samples that differ ...

Mixer: Image to Multi-Modal Retrieval Learning for Industrial Application

Cross-modal retrieval, where the query is an image and the doc is an ite...

Cross-view Semantic Alignment for Livestreaming Product Recognition

Live commerce is the act of selling products online through live streami...

Please sign up or login with your details

Forgot password? Click here to reset