Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation

by   Ensheng Shi, et al.

Code search aims to retrieve the most semantically relevant code snippet for a given natural language query. Recently, large-scale code pre-trained models such as CodeBERT and GraphCodeBERT learn generic representations of source code and have achieved substantial improvement on code search task. However, the high-quality sequence-level representations of code snippets have not been sufficiently explored. In this paper, we propose a new approach with multimodal contrastive learning and soft data augmentation for code search. Multimodal contrastive learning is used to pull together the representations of code-query pairs and push apart the unpaired code snippets and queries. Moreover, data augmentation is critical in contrastive learning for learning high-quality representations. However, only semantic-preserving augmentations for source code are considered in existing work. In this work, we propose to do soft data augmentation by dynamically masking and replacing some tokens in code sequences to generate code snippets that are similar but not necessarily semantic-preserving as positive samples for paired queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. The experimental results show that our approach significantly outperforms the state-of-the-art methods. We also adapt our techniques to several pre-trained models such as RoBERTa and CodeBERT, and significantly boost their performance on the code search task.


page 1

page 2

page 3

page 4


Exploring Representation-Level Augmentation for Code Search

Code search, which aims at retrieving the most relevant code fragment fo...

Evaluation of Contrastive Learning with Various Code Representations for Code Clone Detection

Code clones are pairs of code snippets that implement similar functional...

CrossAug: A Contrastive Data Augmentation Method for Debiasing Fact Verification Models

Fact verification datasets are typically constructed using crowdsourcing...

On the Importance of Building High-quality Training Datasets for Neural Code Search

The performance of neural code search is significantly influenced by the...

Enhancing Code Classification by Mixup-Based Data Augmentation

Recently, deep neural networks (DNNs) have been widely applied in progra...

ContraGen: Effective Contrastive Learning For Causal Language Model

Despite exciting progress in large-scale language generation, the expres...

A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

Every hour, huge amounts of visual contents are posted on social media a...

Please sign up or login with your details

Forgot password? Click here to reset