Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product Retrieval

by   Xiao Dong, et al.

Our goal in this research is to study a more realistic environment in which we can conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories. We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks to enable the evaluations on the price comparison and personalized recommendations. For both instance-level tasks, how to accurately pinpoint the product target mentioned in the visual-linguistic data and effectively decrease the influence of irrelevant contents is quite challenging. To address this, we exploit to train a more effective cross-modal pertaining model which is adaptively capable of incorporating key concept information from the multi-modal data, by using an entity graph whose node and edge respectively denote the entity and the similarity relation between entities. Specifically, a novel Entity-Graph Enhanced Cross-Modal Pretraining (EGE-CMP) model is proposed for instance-level commodity retrieval, that explicitly injects entity knowledge in both node-based and subgraph-based ways into the multi-modal networks via a self-supervised hybrid-stream transformer, which could reduce the confusion between different object contents, thereby effectively guiding the network to focus on entities with real semantic. Experimental results well verify the efficacy and generalizability of our EGE-CMP, outperforming several SOTA cross-modal baselines like CLIP, UNITER and CAPTURE.


page 3

page 4

page 5

page 11

page 12


Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-modal Pretraining

Nowadays, customer's demands for E-commerce are more diversified, which ...

Aspect-based Sentiment Classification with Sequential Cross-modal Semantic Graph

Multi-modal aspect-based sentiment classification (MABSC) is an emerging...

Knowledge-Enhanced Hierarchical Information Correlation Learning for Multi-Modal Rumor Detection

The explosive growth of rumors with text and images on social media plat...

3D Shape Knowledge Graph for Cross-domain and Cross-modal 3D Shape Retrieval

With the development of 3D modeling and fabrication, 3D shape retrieval ...

Modeling Temporal-Modal Entity Graph for Procedural Multimodal Machine Comprehension

Procedural Multimodal Documents (PMDs) organize textual instructions and...

MEAformer: Multi-modal Entity Alignment Transformer for Meta Modality Hybrid

As an important variant of entity alignment (EA), multi-modal entity ali...

A Proposal-based Approach for Activity Image-to-Video Retrieval

Activity image-to-video retrieval task aims to retrieve videos containin...

Please sign up or login with your details

Forgot password? Click here to reset