Document Type


Publication Date



In this paper, we consider how the ability to learn Code Execution Tasks affects a model’s accuracy on software vulnerability detection (SVD) benchmark datasets. We initially find that models can achieve near state-of-the-art accuracy on SVD benchmarks regardless of their ability to learn Code Execution Tasks. However, these models fail to generalize well across SVD benchmarks. The results indicate a bias in the datasets that allows models to predict non- SVD signals. Under the theory that different collection methods will reduce biases, we investigate combining the SVD datasets. When trained on combined datasets, SVD accuracy is reduced but correlation with Code Execution Task accuracy improves. Our contributions are (1) using a reversed curriculum learning to evaluate model capabilities, (2) demonstrating the criticality of code execution understanding to machine learning– assisted software vulnerability detection, (3) evidence that improved diversity of SVD datasets will lead to improved accuracy and generalizability, (4) and benchmarks of recent models across multiple SVD datasets.


This article was presented at the