Measuring Coding Challenge Competence With APPS

by   Dan Hendrycks, et al.

While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take an arbitrary natural language specification and generate satisfactory Python code. Similar to how companies assess candidate software developers, we then evaluate models by checking their generated code on test cases. Our benchmark includes 10,000 problems, which range from having simple one-line solutions to being substantial algorithmic challenges. We fine-tune large language models on both GitHub and our training set, and we find that the prevalence of syntax errors is decreasing exponentially as models improve. Recent models such as GPT-Neo can pass approximately 20 find that machine learning models are now beginning to learn how to code. As the social significance of automatic code generation increases over the coming years, our benchmark can provide an important measure for tracking advancements.


Language Models Can Teach Themselves to Program Better

This work shows how one can use large-scale language models (LMs) to syn...

Evaluating How Fine-tuning on Bimodal Data Effects Code Generation

Despite the increase in popularity of language models for code generatio...

A Study on Robustness and Reliability of Large Language Model Code Generation

Recently, the large language models (LLMs) have shown extraordinary abil...

Towards Enhancing In-Context Learning for Code Generation

In-context learning (ICL) with pre-trained language models (PTLMs) has s...

ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

In this work, we make the first attempt to evaluate LLMs in a more chall...

Unmasking the giant: A comprehensive evaluation of ChatGPT's proficiency in coding algorithms and data structures

The transformative influence of Large Language Models (LLMs) is profound...

How Readable is Model-generated Code? Examining Readability and Visual Inspection of GitHub Copilot

Background: Recent advancements in large language models have motivated ...

Please sign up or login with your details

Forgot password? Click here to reset