Recommendations for Datasets for Source Code Summarization

04/04/2019
by   Alexander LeClair, et al.
0

Source Code Summarization is the task of writing short, natural language descriptions of source code. The main use for these descriptions is in software documentation e.g. the one-sentence Java method descriptions in JavaDocs. Code summarization is rapidly becoming a popular research problem, but progress is restrained due to a lack of suitable datasets. In addition, a lack of community standards for creating datasets leads to confusing and unreproducible research results -- we observe swings in performance of more than 33 changes in dataset design. In this paper, we make recommendations for these standards from experimental results. We release a dataset based on prior work of over 2.1m pairs of Java methods and one sentence method descriptions from over 28k Java projects. We describe the dataset and point out key differences from natural language data, to guide and support future researchers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/29/2021

CoDesc: A Large Code-Description Parallel Dataset

Translation between natural language and source code can help software d...
research
08/06/2017

CodeSum: Translate Program Language to Natural Language

During software maintenance, programmers spend a lot of time on code com...
research
04/07/2023

Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions

Large language models (LLMs), such as OpenAI's Codex, have demonstrated ...
research
02/01/2023

On the Robustness of Code Generation Techniques: An Empirical Study on GitHub Copilot

Software engineering research has always being concerned with the improv...
research
03/08/2021

Atoms of Confusion in Java

Although writing code seems trivial at times, problems arise when humans...
research
08/28/2023

Distilled GPT for Source Code Summarization

A code summary is a brief natural language description of source code. S...
research
12/18/2022

JEMMA: An Extensible Java Dataset for ML4Code Applications

Machine Learning for Source Code (ML4Code) is an active research field i...

Please sign up or login with your details

Forgot password? Click here to reset