WuDaoMM: A large-scale Multi-Modal Dataset for Pre-training models

by   Sha Yuan, et al.

Compared with the domain-specific model, the vision-language pre-training models (VLPMs) have shown superior performance on downstream tasks with fast fine-tuning process. For example, ERNIE-ViL, Oscar and UNIMO trained VLPMs with a uniform transformers stack architecture and large amounts of image-text paired data, achieving remarkable results on downstream tasks such as image-text reference(IR and TR), vision question answering (VQA) and image captioning (IC) etc. During the training phase, VLPMs are always fed with a combination of multiple public datasets to meet the demand of large-scare training data. However, due to the unevenness of data distribution including size, task type and quality, using the mixture of multiple datasets for model training can be problematic. In this work, we introduce a large-scale multi-modal corpora named WuDaoMM, totally containing more than 650M image-text pairs. Specifically, about 600 million pairs of data are collected from multiple webpages in which image and caption present weak correlation, and the other 50 million strong-related image-text pairs are collected from some high-quality graphic websites. We also release a base version of WuDaoMM with 5 million strong-correlated image-text pairs, which is sufficient to support the common cross-modal model pre-training. Besides, we trained both an understanding and a generation vision-language (VL) model to test the dataset effectiveness. The results show that WuDaoMM can be applied as an efficient dataset for VLPMs, especially for the model in text-to-image generation task. The data is released at https://data.wudaoai.cn


Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset and A Foundation Framework

This paper presents a large-scale Chinese cross-modal dataset for benchm...

LAION-5B: An open large-scale dataset for training next generation image-text models

Groundbreaking language-vision architectures like CLIP and DALL-E proved...

Quilt-1M: One Million Image-Text Pairs for Histopathology

Recent accelerations in multi-modal applications have been made possible...

3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment

3D vision-language grounding (3D-VL) is an emerging field that aims to c...

LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs

Multi-modal language-vision models trained on hundreds of millions of im...

Bamboo: Building Mega-Scale Vision Dataset Continually with Human-Machine Synergy

Large-scale datasets play a vital role in computer vision. Existing data...

Scaling Language-Image Pre-training via Masking

We present Fast Language-Image Pre-training (FLIP), a simple and more ef...

Please sign up or login with your details

Forgot password? Click here to reset