Publication Date
2023
Document Type
Dissertation
Committee Members
Junjie Zhang, Ph.D. (Committee Chair); Lingwei Chen, Ph.D. (Committee Co-Chair); Phu Phung, Ph.D. (Committee Member); Tanvi Banerjee, Ph.D. (Committee Member); Krishnaprasad Thirunarayan, Ph.D. (Committee Member)
Degree Name
Doctor of Philosophy (PhD)
Abstract
As our world has become dependent upon software for nearly every aspect of modern society, software security has followed as an essential feature. The first line of defense against vulnerabilities is secure coding. While today’s programmers are carefully taught secure coding best practices, they can make mistakes or intentionally introduce vulnerable code. The traditional backstop to human errors and insider threats is the adoption of automated security analysis tools. These analysis tools have limitations. Static analysis suffers from high false positive rates that may cause annoyance and complacency among developers. Dynamic analysis can be difficult to set up and very computationally expensive. As a result of these shortcomings, researchers have turned to Machine Learning as a way to improve the performance of automated security analysis tools. Recent Machine Learning–Assisted Software Vulnerability Detection (MLAVD) research has focused on large-scale models with hundreds of millions of parameters powered by expensive attention- or graph-based architectures. Despite increased model capacity, current models have limited accuracy and struggle to generalize to unseen data. This dissertation presents systematic research to understand and enhance the efficiency and efficacy of MLAVD models. First, we explore 7 C/C++ datasets and evaluate their suitability for the task. During this, we present a new dataset, named Wild C, containing over 10.3 million individual open-source C/C++ files. We find that the datasets are not representative of typical C/C++ code, diverging in file length, token frequency, vocabulary, etc. Additionally, all the datasets contain duplication, with the Juliet dataset containing pre-split code augmentations making it susceptible to data leakage. Second, we consider how 5 different deep learning architectures perform on 5 tests designed to simulate tasks that are prerequisites for software vulnerability detection. We demonstrate how commonly used MLAVD architectures struggle to learn from these tasks, indicating a gap between true semantic vulnerability detection and the apparent syntactic methods of existing approaches. Finally, we perform the first study of resource-efficient MLAVD, showing that such models can be competitive with strong MLAVD baselines. We design Vul-Mixer, a resource-efficient architecture inspired by how the human brain processes code. Through extensive experimentation, we demonstrate that Vul-Mixer enhances efficiency and efficacy by improving state-of-the-art generalization while using only 0.2% of the baseline’s parameters.
Page Count
148
Department or Program
Department of Computer Science and Engineering
Year Degree Awarded
2023
Copyright
Copyright 2023, some rights reserved. My ETD may be copied and distributed only for non-commercial purposes and may not be modified. All use must give me credit as the original author.
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 4.0 International License.
ORCID ID
0000-0002-2619-1680