Automated Patch Assessment for Program Repair at Scale
In this paper, we do automatic correctness assessment for patches generated by program repair techniques. We consider the human patch as ground truth oracle and randomly generate tests based on it, i.e., Random testing with Ground Truth – RGT. We build a curated dataset of 638 patches for Defects4J generated by 14 state-of-the-art repair systems. We evaluate automated patch assessment on our dataset which is, to our knowledge, the largest ever. The results of this study are novel and significant. First, we show that 10 patches from previous research classified as correct by their respective authors are actually overfitting. Second, we demonstrate that the human patch is not the perfect ground truth. Third, we precisely measure the trade-off between the time spent for test generation and the benefits for automated patch assessment at scale.
READ FULL TEXT