Mixture Models for Left- and Interval-Censored Data and Concordance Indices for Composite Survival Outcomes Open Access
Downloadable ContentDownload PDF
For cost-effectiveness and efficiency, many large-scale general-purpose cohort studies are being assembled within health-care providers who use electronic health records. Two key features of such data are that incident disease is interval-censored between irregular visits and pre-existing (prevalent) disease is left-censored. Because prevalent disease is not always immediately diagnosed, some disease diagnosed at later visits are actually undiagnosed prevalent disease. I consider prevalent disease as a point mass at time zero for clinical applications where there is no interest in time of prevalent disease onset. I demonstrate that the naive Kaplan-Meier cumulative risk estimator underestimates risks at early time points and overestimates later risks. I propose a general family of mixture models that we call prevalence-incidence models. Parameters for parametric prevalence-incidence models, such as the logistic regression and Weibull survival (logistic-Weibull) model, are estimated by direct likelihood maximization or EM algorithm. Non-parametric methods are proposed to calculate cumulative risks for cases without covariates. I compare naive Kaplan-Meier, logistic-Weibull, and non-parametric estimates of cumulative risk in the cervical cancer screening program at Kaiser Permanente Northern California. Kaplan-Meier methods provided poor estimates while the logistic-Weibull model was a close fit to the non-parametric. My findings support use of logistic-Weibull models over Kaplan-Meier methods to develop risk estimates for informing U.S. risk-based cervical cancer screening guidelines.Harrell's c index is widely used to measure the accuracy in predicting univariate survival outcomes. However, survival outcomes relating to a disease of interest may show up in multiple endpoints of interest. I propose two extensions of Harrell's c index for composite survival outcomes that account for frequencies of occurrences and the severity/importance of the outcomes. A weighted C index is proposed for a disease process with multiple equally important endpoints, and a most severe comparable C index is proposed for a disease process with a rare primary outcome and a correlated secondary outcome. Asymptotic properties are derived based on theorems for U-statistics. In the simulation studies, my extensions gain efficiency and power in identifying true prognostic variables. I illustrate these novel concordance indices using the Epidemiology of Diabetes Intervention and Complications (EDIC) and the Diabetes Prevention Program (DPP) trials. In EDIC, the prognosis of diabetes patients at risk for multiple equally important microvascular complications are evaluated using the weighted C index. In DPP, patients with impaired glucose resistance (IGR) who may either progress to type II diabetes or regress to normal glucose resistance (NGR). The proposed most severe comparable index better evaluates the accuracy in predicting diabetes risk with the help of auxiliary NGR outcomes.