CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

by   Hamel Husain, et al.

Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas. To enable evaluation of progress on code search, we are releasing the CodeSearchNet Corpus and are presenting the CodeSearchNet Challenge, which consists of 99 natural language queries with about 4k expert relevance annotations of likely results from CodeSearchNet Corpus. The corpus contains about 6 million functions from open-source code spanning six programming languages (Go, Java, JavaScript, PHP, Python, and Ruby). The CodeSearchNet Corpus also contains automatically generated query-like natural language for 2 million functions, obtained from mechanically scraping and preprocessing associated function documentation. In this article, we describe the methodology used to obtain the corpus and expert labels, as well as a number of simple baseline solutions for the task. We hope that CodeSearchNet Challenge encourages researchers and practitioners to study this interesting task further and will host a competition and leaderboard to track the progress on the challenge. We are also keen on extending to more queries and programming languages in the future.


page 1

page 2

page 3

page 4


Deep Graph Matching and Searching for Semantic Code Retrieval

Code retrieval is to find the code snippet from a large corpus of source...

A parallel corpus of Python functions and documentation strings for automated code documentation and code generation

Automated documentation of programming source code and automated code ge...

Lyra: A Benchmark for Turducken-Style Code Generation

Code generation is crucial to reduce manual software development efforts...

Crowd Sourced Data Analysis: Mapping of Programming Concepts to Syntactical Patterns

Since programming concepts do not match their syntactic representations,...

Semantic Matching Against a Corpus: New Applications and Methods

We consider the case of a domain expert who wishes to explore the extent...

Adversarial Training for Code Retrieval with Question-Description Relevance Regularization

Code retrieval is a key task aiming to match natural and programming lan...

DataFinder: Scientific Dataset Recommendation from Natural Language Descriptions

Modern machine learning relies on datasets to develop and validate resea...

Please sign up or login with your details

Forgot password? Click here to reset