A Language-Agnostic Model for Semantic Source Code Labeling

06/03/2019
by   Ben Gelman, et al.
0

Code search and comprehension have become more difficult in recent years due to the rapid expansion of available source code. Current tools lack a way to label arbitrary code at scale while maintaining up-to-date representations of new programming languages, libraries, and functionalities. Comprehensive labeling of source code enables users to search for documents of interest and obtain a high-level understanding of their contents. We use Stack Overflow code snippets and their tags to train a language-agnostic, deep convolutional neural network to automatically predict semantic labels for source code documents. On Stack Overflow code snippets, we demonstrate a mean area under ROC of 0.957 over a long-tailed list of 4,508 tags. We also manually validate the model outputs on a diverse set of unlabeled source code documents retrieved from Github, and we obtain a top-1 accuracy of 86.6 the model successfully transfers its knowledge from Stack Overflow snippets to arbitrary source code documents.

READ FULL TEXT
research
11/20/2022

The Stack: 3 TB of permissively licensed source code

Large Language Models (LLMs) play an ever-increasing role in the field o...
research
04/05/2018

Visual augmentation of source code editors: A systematic review

Source code written in textual programming languages is typically edited...
research
11/02/2022

Stack graphs: Name resolution at scale

We present stack graphs, an extension of Visser et al.'s scope graphs fr...
research
03/21/2021

Language-Agnostic Representation Learning of Source Code from Structure and Context

Source code (Context) and its parsed abstract syntax tree (AST; Structur...
research
04/15/2019

Semantic Source Code Models Using Identifier Embeddings

The emergence of online open source repositories in the recent years has...
research
07/25/2020

Automated Query Generation for Design Pattern Mining in Source Code

Identifying which design patterns already exist in source code can help ...
research
09/02/2023

Towards Code Watermarking with Dual-Channel Transformations

The expansion of the open source community and the rise of large languag...

Please sign up or login with your details

Forgot password? Click here to reset