Modeling the Correlation Structure of RNA Sequencing Data Using A Multivariate Poisson-Lognormal Model Open Access
Downloadable ContentDownload PDF
High-throughput sequencing technologies have been widely used in biomedical research, especially in human genomic studies. RNA Sequencing (RNA-seq) applies high-throughput sequencing technologies to quantify gene expression, study alternatively spliced gene and discover novel isoform. Poisson distribution based methods have been popularly used to model RNA-seq data in practice. Differential expression analysis of RNA-seq data has been well studied. However, the correlation structure of RNA-seq data has not been extensively studied.The dissertation proposes a multivariate Poisson-lognormal model for the correlation structure of RNA-seq data. This approach enables us to estimate both positive and negative correlations for the count-type RNA-seq data. Three general scenarios have been discussed. In scenario 1, one exon with one isoform, we propose a bivariate Poisson-lognormal model. In scenario 2, multiple exons with one isoform, we propose a multivariate Poisson-lognormal model. Extending to multiple exons level, the number of pairwise correlations increases accordingly. To reduce the parameter space, the block compound symmetry correlation structure has been introduced. And in scenario 3, multiple exons with multiple isoforms, we propose a mixture of multivariate Poisson-lognormal models. Correlation coefficients are estimated by the method of moments. At multiple exons level, we apply the average weighting strategy to reduce the number of moment equations. Simulation studies have been conducted and demonstrate the advantage of our correlation coefficient moment estimator, comparing to Pearson correlation coefficient estimator and Spearman's rank correlation coefficient estimator.For application illustrations, we apply our methods to the RNA-seq data from The Cancer Genome Atlas (TCGA, breast cancer study). We estimate the correlation coefficient between gene TP53 and gene CDKN1A with normal subjects. The results show that TP53 and CDKN1A are slightly negative correlated.