BugGraph : Graph-Based Vulnerability Analysis for Binary Code with Syntax Similarity Open Access
Downloadable ContentDownload PDF
Similarity detection, which answers whether two codes are similar or not, has been widely used for binary vulnerability detection. Existing approaches face two challenges to achieve high accuracy and coverage: compiler and source code induced syntax similarity. Compiler induced syntax similarity appears when the source code is compiled with different toolchain configurations (e.g., compiler family, compiler version, and optimization level). Source code induced syntax similarity comes from the fact that the same vulnerability exists not only in the equivalent or highly similar code, but also in the less similar code. In this work, we design BugGraph to use several graph-based machine learning methods for accurately detecting similar binary codes. Specifically, we identify the compilation configuration (provenance) of a binary code for similarity detection by constructing the function call graph from the binary, and leveraging a graph attention network to highlight the functions that are more representative for the correct provenance. Further, we develop a generalized code similarity model, which employs a graph-based triplet network to learn from a broad range of ranked similar codes instead of polarized ones falling into either similar or different. The experiments have shown that BugGraph achieves up to 95% true positive rate (TPR) for binaries with both source code and compiler induced variance. Our provenance identifier is able to improve up to 39% TPR for compared works. In addition, we have applied BugGraph on real-world vulnerability detection. Specifically, we are able to achieve 84% TPR on OpenSSL and find 41 unpatched vulnerabilities on three firmware.