Electronic Thesis/Dissertation


Learn2Reason: Joint Statistical and Formal learning Approach to improve the Robustness and Time-to-solution for Software Security Open Access

Downloadable Content

Download PDF

With the rapid rise in software sizes and complexity, analyzing and fixing bugs in large scale applications is becoming increasingly critical, securing such application has become very challenging due to the growing software complexity. Traditionally, there are two lines of code analysis techniques that have some fundamental limitations: Pure statistical methods and Pure formal methods. Using solely either lacks accuracy or require exhaustive analysis along all paths in the application code. In this dissertation, we first design a joint learning framework using both techniques to protect unsafe memory accesses in programs. As today, such memory violation issues have been among the leading causes of software vulnerability. Memory safety checkers, such as Softbound, enforce memory spatial safety by checking if accesses to array elements are within the corresponding array bounds. However, such checks often result in high execution time overhead due to the cost of executing the instructions associated with the bound checks. To mitigate this problem, techniques to eliminate redundant bound checks are needed. We propose two novel frameworks, SIMBER and Clone-Hunter to eliminate redundant memory bound checks in source code and binaries respectively. In contrast to the existing techniques that primarily rely on static code analysis, our solution leverages learning-based techniques to identify redundant bound checks.Additionally, understanding software and detecting duplicate code fragments is an important task, especially in large code bases. Detecting similar code fragments, usually referred to as \textit{code clones}, can be helpful in discovering vulnerabilities, refactoring code and removing unnecessary code segments. In particular, binary code clone detection can have significant uses in the context of legacy applications that are already deployed in several critical domains. We present learning based frameworks for Domain-Specific Code Clone Detection. Our approach first eliminates non-domain-related instructions through program slicing, and then applies deep learning-based algorithm to model code samples as numerical vectors for the remaining binary instructions. We then use clustering algorithms to aggregate code clones, and use formal analysis to verify the validity of code clones.To further illustrate the benefit of our joint learning approach, we leverage machine learning-based binary code analysis frameworks, combined with dynamic execution and trace analysis to create customized, self-contained programs, in order to minimize the potential attack surface. It automatically identifies program features (i.e., independent, well-contained operations, utilities, or capabilities) relating to application binaries and their communication functions, tailors and eliminates the features to create customized program binaries in accordance with user needs, in a fully unsupervised fashion.This dissertation aims to harnesses the advantages of both statistical analysis and formal techniques to perform rigorous code analysis in both source code and binary executables while maintaining scalability and swiftness. The main contributions of this dissertation are to improve the security issues of code analysis by integrating statistical analysis and formal methods, thereby reducing the time-to-solution.

Author Language Keyword
Date created Type of Work Rights statement GW Unit Degree Advisor Committee Member(s) Persistent URL