Concordant Integrative Analysis of Multiple Gene Expression Data Sets Open Access
Downloadable ContentDownload PDF
Microarray is an experimental method by which tens of thousands of genes can be printed on a small chip. This technology enables us to measure genome-wide expression profiles. The cost of a microarray experiment is still relatively high. Therefore, the sample size of a microarray experiment is still relatively small. For some important disease studies, microarray data have been collected by different laboratories. We expect to obtain more efficient analysis results if different data sets collected for the same or similar study can be integrated. However, due to many complicated experimental issues, it is necessary to evaluate the genome-wide concordance among these data sets before their integrative analysis. If the underlying behavior of a gene is consistent among different experiments, then the related expression profiles in different data sets will be concordant. Statistically, mixture models have been widely used to accommodate unobserved heterogeneities in a study population. A mixture model based method has been proposed for the integrative concordant analysis when there are two microarray data sets available for an integrative analysis. It is necessary to extend this approach for an integrative analysis of multiple data sets.The general statistical framework for our integrative analysis is the partial concordance/discordance (PCD) model. Its related statistical estimation difficulty is that its parameter space increases exponentially with the number of data sets. Since the complete concordance model (CC) and the complete independence (CI) model are two basic statistical frameworks that can be derived from the PCD model, we propose a two-level mixture model to approximate the PCD model. It combines the basic CC and CI models and its parameter space increases linearly with the number of data sets. We have implemented an expectation-maximization algorithm for the model parameter estimation. Simulation studies have been conducted to understand the performance of our method. We have also applied our method to a collection of microarray gene expression data sets for a lung cancer study.Furthermore, we have also developed other approaches to decrease the parameter space of PCD model by simplifying the non-diagonal proportion parameters. The inspiration comes from the exchangeable structure and AR(1) structure in GEE, as well as the multiset coefficient in combinatorics. We still consider expectation-maximization algorithm to achieve the model fitting. The performance of the proposed methods is examined using simulation studies. We have also compared these methods with the two-level mixture model based method through applications to the same experimental data sets from the lung cancer study.