Adaptive Contrastive Learning on Multimodal Transformer for Review Helpfulness Predictions

by   Thong Nguyen, et al.
National University of Singapore
Nanyang Technological University

Modern Review Helpfulness Prediction systems are dependent upon multiple modalities, typically texts and images. Unfortunately, those contemporary approaches pay scarce attention to polish representations of cross-modal relations and tend to suffer from inferior optimization. This might cause harm to model's predictions in numerous cases. To overcome the aforementioned issues, we propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem, concentrating on mutual information between input modalities to explicitly elaborate cross-modal relations. In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach in order to increase flexibility in optimization. Lastly, we propose Multimodal Interaction module to address the unalignment nature of multimodal data, thereby assisting the model in producing more reasonable multimodal representations. Experimental results show that our method outperforms prior baselines and achieves state-of-the-art results on two publicly available benchmark datasets for MRHP problem.


page 1

page 2

page 3

page 4


Using Multiple Instance Learning to Build Multimodal Representations

Image-text multimodal representation learning aligns data across modalit...

SANCL: Multimodal Review Helpfulness Prediction with Selective Attention and Natural Contrastive Learning

With the boom of e-commerce, Multimodal Review Helpfulness Prediction (M...

StreaMulT: Streaming Multimodal Transformer for Heterogeneous and Arbitrary Long Sequential Data

This paper tackles the problem of processing and combining efficiently a...

LANISTR: Multimodal Learning from Structured and Unstructured Data

Multimodal large-scale pretraining has shown impressive performance gain...

Exploiting Pseudo Image Captions for Multimodal Summarization

Cross-modal contrastive learning in vision language pretraining (VLP) fa...

Soft Correspondences in Multimodal Scene Parsing

Exploiting multiple modalities for semantic scene parsing has been shown...

Enhanced Multimodal Representation Learning with Cross-modal KD

This paper explores the tasks of leveraging auxiliary modalities which a...

Please sign up or login with your details

Forgot password? Click here to reset