Optimal Subset Selection for Causal Inference Using Machine Learning and Particle Swarm Optimization Open Access
Downloadable ContentDownload PDF
There has been a notable rise in the availability of non-experimental observational data sets. Observational data sets include government survey data and “big data” from continuous monitoring. To date, most of this data has spurred correlational studies and there is interest in being able to draw causal inferences from this data. The goal of this study is to suggest and evaluate a method for optimal construction of synthetic treatment and control samples for the purpose of drawing causal inference. A method of balancing data sets to remove bias using machine learning as a two-sample test is proposed and validated. The study builds on the balance optimization subset selection (BOSS) problem, which is a new area of study in operations research. This problem formulation minimizes aggregate imbalance in covariate distributions to reduce bias in data. The cross-validated area under the receiver operating characteristic curve (AUC) is proposed as a measure of balance between treatment and control groups. The proposed approach provides direct and automatic balancing of covariate distributions. In addition, the AUC-based approach is able to detect subtler distributional differences than existing measures, such as simple empirical mean/variance and count-based metrics. Thus, optimizing AUC achieves a greater balance. Using 5 widely used real data sets and 7 synthetic data sets, it is shown that optimization of samples using existing methods (chi-square, mean variance differences, Kolmogorov-Smirnov, and Mahalanobis) results in samples containing imbalance that is detectable using machine learning algorithms. Minimizing covariate imbalance by minimizing the absolute value of the distance of the maximum cross-validated AUCs (from 0.50) using evolutionary optimization on M folds is found to be effective. Particle swarm optimization (PSO) outperforms modified cuckoo swarm (MCS) for this proposed gradient-free, non-linear noisy cost function. To compute AUCs, supervised binary classification approaches from the machine learning and credit scoring literature are used.