A Software Metrics Clustering Approach to Cross-Project Defect Prediction Open Access
Downloadable ContentDownload PDF
The idea of predicting software defects for projects with little or insufficient historical defect data using other projects' software metrics and defect data has attracted great interest from both researchers and practitioners. However, the primary challenge in cross-project defect prediction (CPDP) comes from the heterogeneity between the training data and the target project (Cruz & Ochimizu, 2009; Nam, Pan, & Kim, 2013; Canfora et al., 2015; Zhang et al., 2016). Various prediction methods have been proposed and reviewed in the literature with a view to reducing this heterogeneity. One recognized pattern in the research undertaken to date is that complete training data is regarded as input and correlated metrics are rarely mitigated when constructing prediction models (Tantithamthavorn et al., 2016, 2018; Jiarpakdee et al., 2019). This praxis investigates the effects of correlated metrics on defect prediction models at different treatment threshold values and provides a guideline for treating redundant software metrics before predictions. As suspected, the empirical findings suggest that defect predictors using redundant software metrics may yield overly optimistic performance results.Furthermore, the data exploration on defect datasets confirms that the distribution of software metrics varies for each project, thus making the task of cross-project defect prediction even more challenging. This praxis proposes a novel cross-project defect prediction approach by employing a data transformation and clustering software modules based on their metric profiles. The empirical results – based on fifteen open-source software projects – indicate that the proposed method is a promising direction to pursue with a view to reducing heterogeneity between the training data and the target project.