A Review of Vision-Language Models and their Performance on the Hateful Memes Challenge

by   Bryan Zhao, et al.

Moderation of social media content is currently a highly manual task, yet there is too much content posted daily to do so effectively. With the advent of a number of multimodal models, there is the potential to reduce the amount of manual labor for this task. In this work, we aim to explore different models and determine what is most effective for the Hateful Memes Challenge, a challenge by Meta designed to further machine learning research in content moderation. Specifically, we explore the differences between early fusion and late fusion models in classifying multimodal memes containing text and images. We first implement a baseline using unimodal models for text and images separately using BERT and ResNet-152, respectively. The outputs from these unimodal models were then concatenated together to create a late fusion model. In terms of early fusion models, we implement ConcatBERT, VisualBERT, ViLT, CLIP, and BridgeTower. It was found that late fusion performed significantly worse than early fusion models, with the best performing model being CLIP which achieved an AUROC of 70.06. The code for this work is available at https://github.com/bzhao18/CS-7643-Project.


page 1

page 2

page 3


Multimodal Interactions Using Pretrained Unimodal Models for SIMMC 2.0

This paper presents our work on the Situated Interactive MultiModal Conv...

NLP-CUET@DravidianLangTech-EACL2021: Investigating Visual and Textual Features to Identify Trolls from Multimodal Social Media Memes

In the past few years, the meme has become a new way of communication on...

Classifying Math KCs via Task-Adaptive Pre-Trained BERT

Educational content labeled with proper knowledge components (KCs) are p...

Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions

As multimodal learning finds applications in a wide variety of high-stak...

Late Fusion with Triplet Margin Objective for Multimodal Ideology Prediction and Analysis

Prior work on ideology prediction has largely focused on single modaliti...

Misogynistic Meme Detection using Early Fusion Model with Graph Network

In recent years , there has been an upsurge in a new form of entertainme...

The Keyword Explorer Suite: A Toolkit for Understanding Online Populations

We have developed a set of Python applications that use large language m...

Please sign up or login with your details

Forgot password? Click here to reset