MultiPL-E: A Scalable and Extensible Approach to Benchmarking Neural Code Generation

by   Federico Cassano, et al.

Large language models have demonstrated the ability to generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen, and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.


page 5

page 11

page 13

page 20

page 21

page 22

page 23

page 25


Measuring The Impact Of Programming Language Distribution

Current benchmarks for evaluating neural code models focus on only a sma...

MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages

While there has been a recent burgeoning of applications at the intersec...

Multi-lingual Evaluation of Code Generation Models

We present MBXP, an execution-based code completion benchmark in 10+ pro...

ChatGPT for PLC/DCS Control Logic Generation

Large language models (LLMs) providing generative AI have become popular...

Coder Reviewer Reranking for Code Generation

Sampling diverse programs from a code language model and reranking with ...

StudentEval: A Benchmark of Student-Written Prompts for Large Language Models of Code

Code LLMs are being rapidly deployed and there is evidence that they can...

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Large Language Models (LLMs) have successfully been applied to code gene...

Please sign up or login with your details

Forgot password? Click here to reset