CONCORD: Clone-aware Contrastive Learning for Source Code

06/05/2023
by   Yangruibo Ding, et al.
0

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2022

COMBO: Pre-Training Representations of Binary Code Using Contrastive Learning

Compiled software is delivered as executable binary code. Developers wri...
research
08/10/2021

SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation

Code representation learning, which aims to encode the semantics of sour...
research
06/15/2022

NatGen: Generative pre-training by "Naturalizing" source code

Pre-trained Generative Language models (e.g. PLBART, CodeT5, SPT-Code) f...
research
03/27/2022

HELoC: Hierarchical Contrastive Learning of Source Code Representation

Abstract syntax trees (ASTs) play a crucial role in source code represen...
research
12/05/2021

VarCLR: Variable Semantic Representation Pre-training via Contrastive Learning

Variable names are critical for conveying intended program behavior. Mac...
research
07/09/2020

Contrastive Code Representation Learning

Machine-aided programming tools such as automated type predictors and au...
research
09/06/2020

Self-Supervised Learning for Code Retrieval and Summarization through Semantic-Preserving Program Transformations

Code retrieval and summarization are useful tasks for developers, but it...

Please sign up or login with your details

Forgot password? Click here to reset