CupCleaner: A Data Cleaning Approach for Comment Updating

by   Qingyuan Liang, et al.

Recently, deep learning-based techniques have shown promising performance on various tasks related to software engineering. For these learning-based approaches to perform well, obtaining high-quality data is one fundamental and crucial issue. The comment updating task is an emerging software engineering task aiming at automatically updating the corresponding comments based on changes in source code. However, datasets for the comment updating tasks are usually crawled from committed versions in open source software repositories such as GitHub, where there is lack of quality control of comments. In this paper, we focus on cleaning existing comment updating datasets with considering some properties of the comment updating process in software development. We propose a semantic and overlapping-aware approach named CupCleaner (Comment UPdating's CLEANER) to achieve this purpose. Specifically, we calculate a score based on semantics and overlapping information of the code and comments. Based on the distribution of the scores, we filter out the data with low scores in the tail of the distribution to get rid of possible unclean data. We first conducted a human evaluation on the noise data and high-quality data identified by CupCleaner. The results show that the human ratings of the noise data identified by CupCleaner are significantly lower. Then, we applied our data cleaning approach to the training and validation sets of three existing comment updating datasets while keeping the test set unchanged. Our experimental results show that even after filtering out over 30% of the data using CupCleaner, there is still an improvement in all performance metrics. The experimental results on the cleaned test set also suggest that CupCleaner may provide help for constructing datasets for updating-related tasks.


page 1

page 10


Learning to Update Natural Language Comments Based on Code Changes

We formulate the novel task of automatically updating an existing natura...

The "Shut the f**k up" Phenomenon: Characterizing Incivility in Open Source Code Review Discussions

Code review is an important quality assurance activity for software deve...

HatCUP: Hybrid Analysis and Attention based Just-In-Time Comment Updating

When changing code, developers sometimes neglect updating the related co...

A probabilistic interpretation of replicator-mutator dynamics

In this note, we investigate the relationship between probabilistic upda...

Automated Identification of Toxic Code Reviews: How Far Can We Go?

Toxic conversations during software development interactions may have se...

A ground-truth dataset and classification model for detecting bots in GitHub issue and PR comments

Bots are frequently used in Github repositories to automate repetitive a...

DeepPERF: A Deep Learning-Based Approach For Improving Software Performance

Improving software performance is an important yet challenging part of t...

Please sign up or login with your details

Forgot password? Click here to reset