Document Type

Article

Publication Date

2021

Abstract

As machine learning-assisted vulnerability detection research matures, it is critical to understand the datasets being used by existing papers. In this paper, we explore 7 C/C++ datasets and evaluate their suitability for machine learning-assisted vulnerability detection. We also present a new dataset, named Wild C, containing over 10.3 million individual opensource C/C++ files – a sufficiently large sample to be reasonably considered representative of typical C/C++ code. To facilitate comparison, we tokenize all of the datasets and perform the analysis at this level. We make three primary contributions. First, while all the datasets differ from our Wild C dataset, some do so to a greater degree. This includes divergence in file lengths and token usage frequency. Additionally, none of the datasets contain the entirety of the C/C++ vocabulary. These missing tokens account for up to 11% of all token usage. Second, we find all the datasets contain duplication with some containing a significant amount. In the Juliet dataset, we describe augmentations of test cases making the dataset susceptible to data leakage. This augmentation occurs with such frequency that a random 80/20 split has roughly 58% overlap of the test with the training data. Finally, we collect and process a large dataset of C code, named Wild C. This dataset is designed to serve as a representative sample of all C/C++ code and is the basis for our analyses.

Comments

This work is licensed under a Creative Commons Attribution 4.0 International License.

Repository Citation

Grahn, D., & Zhang, J. (2021). An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection. Proceedings of the Conference on Applied Machine Learning for Information Security, 2021.
https://corescholar.libraries.wright.edu/cse/605

Download

Included in

Computer Sciences Commons, Engineering Commons

COinS

Computer Science and Engineering Faculty Publications

An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection

Document Type

Publication Date

Abstract

Comments

Repository Citation

Included in

Search

Browse

About

SelectedWorks Sites

Computer Science and Engineering Faculty Publications

An Analysis of C/C++ Datasets for Machine Learning-Assisted Software Vulnerability Detection

Authors

Document Type

Publication Date

Abstract

Comments

Repository Citation

Included in

Share

Search

Browse

About

SelectedWorks Sites