Large-scale Bilingual Language-Image Contrastive Learning

03/28/2022
by   ByungSoo Ko, et al.
0

This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.

READ FULL TEXT

page 8

page 9

page 14

research
10/24/2022

Multilingual Multimodal Learning with Machine Translated Text

Most vision-and-language pretraining research focuses on English tasks. ...
research
11/02/2022

M-SpeechCLIP: Leveraging Large-Scale, Pre-Trained Models for Multilingual Speech to Image Retrieval

This work investigates the use of large-scale, pre-trained models (CLIP ...
research
03/02/2021

WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning

The milestone improvements brought about by deep representation learning...
research
04/18/2021

LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding

Multimodal pre-training with text, layout, and image has achieved SOTA p...
research
05/13/2023

RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training

Multilingual vision-language (V L) pre-training has achieved remarkabl...
research
08/30/2019

PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification

Most existing work on adversarial data generation focuses on English. Fo...

Please sign up or login with your details

Forgot password? Click here to reset