TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models

by   Md. Kamrul Hasan, et al.

Pre-trained large language models have recently achieved ground-breaking performance in a wide variety of language understanding tasks. However, the same model can not be applied to multimodal behavior understanding tasks (e.g., video sentiment/humor detection) unless non-verbal features (e.g., acoustic and visual) can be integrated with language. Jointly modeling multiple modalities significantly increases the model complexity, and makes the training process data-hungry. While an enormous amount of text data is available via the web, collecting large-scale multimodal behavioral video datasets is extremely expensive, both in terms of time and money. In this paper, we investigate whether large language models alone can successfully incorporate non-verbal information when they are presented in textual form. We present a way to convert the acoustic and visual information into corresponding textual descriptions and concatenate them with the spoken text. We feed this augmented input to a pre-trained BERT model and fine-tune it on three downstream multimodal tasks: sentiment, humor, and sarcasm detection. Our approach, TextMI, significantly reduces model complexity, adds interpretability to the model's decision, and can be applied for a diverse set of tasks while achieving superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment analysis and multimodal humor detection) performance. We propose TextMI as a general, competitive baseline for multimodal behavioral analysis tasks, particularly in a low-resource setting.


page 1

page 2

page 5


Few-shot Multimodal Sentiment Analysis based on Multimodal Probabilistic Fusion Prompts

Multimodal sentiment analysis is a trending topic with the explosion of ...

Interpretable multimodal sentiment analysis based on textual modality descriptions by using large-scale language models

Multimodal sentiment analysis is an important area for understanding the...

UniDoc: A Universal Large Multimodal Model for Simultaneous Text Detection, Recognition, Spotting and Understanding

In the era of Large Language Models (LLMs), tremendous strides have been...

Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis

Most existing methods focus on sentiment analysis of textual data. Howev...

On the Use of Modality-Specific Large-Scale Pre-Trained Encoders for Multimodal Sentiment Analysis

This paper investigates the effectiveness and implementation of modality...

Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis

Multimodal learning pipelines have benefited from the success of pretrai...

Characterizing Hirability via Personality and Behavior

While personality traits have been extensively modeled as behavioral con...

Please sign up or login with your details

Forgot password? Click here to reset