The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

05/09/2023
by   Dung Nguyen Manh, et al.
0

We present The Vault, an open-source, large-scale code-text dataset designed to enhance the training of code-focused large language models (LLMs). Existing open-source datasets for training code-based LLMs often face challenges in terms of size, quality (due to noisy signals), and format (only containing code function and text explanation pairings). The Vault overcomes these limitations by providing 40 million code-text pairs across 10 popular programming languages, thorough cleaning for 10+ prevalent issues, and various levels of code-text pairings, including class, function, and line levels. Researchers and practitioners can utilize The Vault for training diverse code-focused LLMs or incorporate the provided data cleaning methods and scripts to improve their datasets. By employing The Vault as the training dataset for code-centric LLMs, we anticipate significant advancements in code understanding and generation tasks, fostering progress in both artificial intelligence research and software development practices.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/08/2023

A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

Large Language Models (LLMs) are advanced Artificial Intelligence (AI) s...
research
05/25/2021

Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks

Advancements in deep learning and machine learning algorithms have enabl...
research
05/09/2018

The Dataset Nutrition Label: A Framework To Drive Higher Data Quality Standards

Artificial intelligence (AI) systems built on incomplete or biased data ...
research
03/17/2022

CodeReviewer: Pre-Training for Automating Code Review Activities

Code review is an essential part to software development lifecycle since...
research
04/11/2023

Evaluating AIGC Detectors on Code Content

Artificial Intelligence Generated Content (AIGC) has garnered considerab...
research
08/25/2023

COCO: Testing Code Generation Systems via Concretized Instructions

Code generation systems have been extensively developed in recent years ...
research
09/17/2023

CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

The driving factors behind the development of large language models (LLM...

Please sign up or login with your details

Forgot password? Click here to reset