DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

by   Yizheng Chen, et al.

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 150 CWEs, 26,635 vulnerable functions, and 352,606 non-vulnerable functions extracted from 7,861 commits. Our dataset covers 305 more projects than all previous datasets combined. We show that increasing the diversity and volume of training data improves the performance of deep learning models for vulnerability detection. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. However, we also identify hopeful future research directions. We demonstrate that large language models (LLMs) are the future for vulnerability detection, outperforming Graph Neural Networks (GNNs) with manual feature engineering. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.


page 1

page 2

page 3

page 4


Deep-Learning-based Vulnerability Detection in Binary Executables

The identification of vulnerabilities is an important element in the sof...

LineVD: Statement-level Vulnerability Detection using Graph Neural Networks

Current machine-learning based software vulnerability detection methods ...

Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned

Binary code similarity analysis (BCSA) is widely used for diverse securi...

Sequential Graph Neural Networks for Source Code Vulnerability Identification

Vulnerability identification constitutes a task of high importance for c...

Exploiting Token and Path-based Representations of Code for Identifying Security-Relevant Commits

Public vulnerability databases such as CVE and NVD account for only 60 s...

DeepDFA: Dataflow Analysis-Guided Efficient Graph Learning for Vulnerability Detection

Deep learning-based vulnerability detection models have recently been sh...

Featherweight Assisted Vulnerability Discovery

Predicting vulnerable source code helps to focus attention on those part...

Please sign up or login with your details

Forgot password? Click here to reset