To What Extent do Deep Learning-based Code Recommenders Generate Predictions by Cloning Code from the Training Set?

04/14/2022
by   Matteo Ciniselli, et al.
0

Deep Learning (DL) models have been widely used to support code completion. These models, once properly trained, can take as input an incomplete code component (e.g., an incomplete function) and predict the missing tokens to finalize it. GitHub Copilot is an example of code recommender built by training a DL model on millions of open source repositories: The source code of these repositories acts as training data, allowing the model to learn "how to program". The usage of such a code is usually regulated by Free and Open Source Software (FOSS) licenses, that establish under which conditions the licensed code can be redistributed or modified. As of Today, it is unclear whether the code generated by DL models trained on open source code should be considered as "new" or as "derivative" work, with possible implications on license infringements. In this work, we run a large-scale study investigating the extent to which DL models tend to clone code from their training set when recommending code completions. Such an exploratory study can help in assessing the magnitude of the potential licensing issues mentioned before: If these models tend to generate new code that is unseen in the training set, then licensing issues are unlikely to occur. Otherwise, a revision of these licenses urges to regulate how the code generated by these models should be treated when used, for example, in a commercial setting. Highlights from our results show that  10 code completion tool are Type-1 clones of instances in the training set, depending on the size of the predicted code. Long predictions are unlikely to be cloned.

READ FULL TEXT

page 6

page 7

research
10/25/2021

CoProtector: Protect Open-Source Code against Unauthorized Training Usage with Data Poisoning

Github Copilot, trained on billions of lines of public code, has recentl...
research
03/12/2021

An Empirical Study on the Usage of BERT Models for Code Completion

Code completion is one of the main features of modern Integrated Develop...
research
02/13/2020

Deep Learning for Source Code Modeling and Generation: Models, Applications and Challenges

Deep Learning (DL) techniques for Natural Language Processing have been ...
research
03/13/2020

The TrojAI Software Framework: An OpenSource tool for Embedding Trojans into Deep Learning Models

In this paper, we introduce the TrojAI software framework, an open sourc...
research
05/28/2021

Learning to Extend Program Graphs to Work-in-Progress Code

Source code spends most of its time in a broken or incomplete state duri...
research
05/16/2021

SLGPT: Using Transfer Learning to Directly Generate Simulink Model Files and Find Bugs in the Simulink Toolchain

Finding bugs in a commercial cyber-physical system (CPS) development too...
research
01/18/2022

Using Pre-Trained Models to Boost Code Review Automation

Code review is a practice widely adopted in open source and industrial p...

Please sign up or login with your details

Forgot password? Click here to reset