nif:isString
|
-
To illustrate different prediction analytic approaches, we focused in SUD outpatient treatment for Hispanics during adulthood using publicly available TEDS-D 2006–2011 data [17]. TEDS-D is an excellent example of a large administrative dataset that may be of interest to addiction researchers in real life. This dataset allows us to illustrate the use of the methodologies introduced in this work within a realistic setting. Treatment completed was considered a successful treatment discharge status, all other treatment discharge reasons (e.g., left against professional advice, incarcerated, other) were considered as indicators of non-successful treatment. Treatment completion is a standard process outcome measure because it predicts longer-term outcomes such as less future criminal involvement, fewer readmissions, employment and income one year after treatment [19–22].
Twenty eight predictors recorded by TEDS-D were included in the analysis. Predictors include 10 patient characteristics (i.e., age, gender, race–e.g., White, Black—, ethnicity–indicating patient’s specific Hispanic origin, marital status, education, employment status, pregnant at time of admission, veteran status, and living arrangement), 3 treatment characteristics (i.e., intensity, medication-assisted opioid therapy, and length of stay), principal source of referral, summary of type of problematic substance (with categories “alcohol only”, “other drugs only”, or “alcohol and drugs”), and mental health problem. TEDS-D records thorough information about substances of misuse. This includes the following 12 predictors: primary, secondary, and tertiary substance problem, usual route of administration, frequency of use, and age at first use. The substances include: alcohol, cocaine/crack, marijuana/hashish, prescription opiates/synthetics, and methamphetamine use. Several other drug use categories were collapsed for analysis because of low percentages.
TEDS-D includes all admissions/discharges rather than individuals. Consequently, only records that indicate the individual had no prior SUD treatment were included in the analyses. Part of our previous work focuses on racial and ethnic minorities [11, 12, 23, 24]. It is known that racial and ethnic minorities vary in their treatment access and success levels [25–27]. Thus, we restricted our analysis only to cases indicating a Hispanic/Latino ethnicity, 18 years old or older, and treatment in outpatient service settings. We focused on outpatient service settings because criteria for treatment completion/success (i.e., the outcome of interest) and duration for other types of services (e.g., 24-hour inpatient, detoxification-only) and outpatient services are often very different. This is a sample arbitrarily chosen to exemplify the different analytical approaches. The choice is based on our previous knowledge in this field and not on the results obtained after using the methods described in the following paragraphs.
Records with missing data in any of the predictors, the outcome, or characteristics used for determining inclusion in the study were excluded from the analysis. Since not all states collect the same information for their patients, from a total of 9,829,536 records, 4,385,825 (45%) had all required data complete. These inclusion and exclusion criteria did not affect the performance of SL when compared to the rest of the analytical strategies used. A total of 99,013 records representing unique individuals were selected according to the inclusion and exclusion criteria and were used in the analysis. This sample was separated in a training set with 80% of the sample (n = 79,210) and a test set with the remaining 20% (n = 19,802). Because these data represent public information and there is no subject identification, the University of Iowa Human Subjects Office Institutional Review Board exempted this study from review.
The goal of the analysis was to compare results of classical analytical strategies that are commonly used to address prediction of treatment success (e.g., logistic regression) alongside results of newer methods for prediction (e.g., ANN). We also compared SL results. SL is an ensemble machine learning methodology that encompasses all other methodologies in its library and has demonstrated (both theoretical and empirical) superiority for prediction [8, 9]. Formally, the analysis is as follows. Let W = {W1, …, Wk} denote the predictors of interest and Y represent the binary outcome treatment success (Yes/No). Let O = (W, Y) be a random variable such that O~P0 (i.e., the true probability distribution of O is P0). Since only records that indicate the individual had no prior treatment in a drug or alcohol program were included in the analyses, we assumed that the individuals observed can be represented as independent and identically distributed observations of the random variable O. For each individual i, Yi and Wi are observed. We sought to estimate Q¯0=P0(Y=Yes|W)—i.e., the probability of succeeding in treatment given the predictors of interest. Q¯0 is unknown. Hence, we aimed to find the best estimator of Q¯0. This was achieved by maximizing the AUC. We did this using logistic regression, 3 types of penalized regression, RF, ANN, and the corresponding SL that includes these algorithms in its library. SL also maximized the AUC. We used 2 mechanisms to avoid overfitting. The analysis used 2-fold cross-validation (CV) using 80% of the data set (n = 79,210) and the final model was validated using 19,802 randomly selected individuals that were only included in the test set (Fig 1). All analytic approaches explored were compared using the AUC in the test set. The best analytic strategy for predicting treatment success was identified as the model maximizing AUC. The AUC is a convenient measure of prediction success that can be interpreted as the probability that any of the algorithms ranks a randomly chosen successful patient higher than a randomly chosen unsuccessful patient [28]. AUC ranges from 0 to 1, AUC = 1 means perfect prediction and AUC = 0.5 suggests chance levels of prediction.
Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0175383.g001 Analytic work flow. We used at least 2 different configurations for each analytic strategy compared. The first approach included all 28 predictors available. Since it can be useful to reduce the number of predictors considered and simplify the final prediction formula, we also selected a subset of the 10 predictors with highest variable importance in a RF model. In decreasing order of importance, these predictors were: length of stay, age, principal source of referral, primary problematic substance, its age of first use, and its frequency of use, employment level, SUD type (i.e., only alcohol, only drugs, or both alcohol and drugs), education level, and patient’s specific Hispanic origin (e.g., Puerto Rican, Mexican, Cuban). For each set of predictors, the following algorithms were compared: logistic regression, least absolute shrinkage and selection operator (lasso), ridge, and elastic net penalized regressions [29], RF [30], ANN [29, 31], and SL [8]. Since none of the aforementioned regression models considered interaction effects in the predictive model, an additional model including terms for all predictors and selected 2-way interactions was also added to the SL library for each type of regression. Two-way interactions were initially screened using all possible logistic regression models of 2 predictors at a time and their 2-way interaction. All interactions with p-values<0.0001 were included along with all predictors in the regression models of the SL library. We compared 17 algorithms/algorithm configurations. There is a multitude of other analytic strategies that could be chosen [29]; however, to illustrate SL use for predicting treatment success when compared to other methods, we consider the chosen algorithms are adequate. In this context, where the data generating model is unknown, SL will be superior to the rest of the methodologies, regardless of the set of algorithms initially chosen. Some characteristics of the chosen algorithms are presented subsequently. Logistic parametric regression is the most commonly used algorithm for prediction in the SUD treatment outcomes field. Logistic regression is easily implemented and interpreted. However, logistic regression assumptions are strong and, since the true data generating model is unknown, these assumptions are usually violated. For example, including many predictors, their interactions, and/or other higher order terms in a logistic regression model does not guarantee that it is the best model, due, for example, to collinearity between the predictors included (which increases variability) or model misspecification (which introduces bias).
Penalized regression (such as lasso, ridge regression, or elastic nets) offers an alternative to parametric regression models. The 3 types of penalized regression applied in this work vary in their variance/bias tradeoffs depending on the characteristics of the predictor set. For example, lasso will select only one term from a set of correlated predictors. This may not be appropriate. In fact, when the number of predictors is small compared to the number of independent observations, ridge regression outperforms lasso when variables are highly correlated. Additionally, if the true data generating model has only a few predictors but the candidate models have a large number of predictors; lasso may eliminate predictors that were not in the data generating model. On the other hand, ridge regression will keep all terms in the final model. The elastic net penalized regression provides some balance between lasso and the ridge regressions. In the example presented here, there are 28 categorical predictors that correspond to 135 terms when the model is parametrized using dummy variables. The screening step for the most relevant 2-way interactions of these 135 terms, preselected 257 2-way interaction terms. Thus, the regression models including all predictors and selected interactions had 393 terms including the intercept. While logistic regression estimated 393 parameters for this model, lasso estimated parameters only for terms uncorrelated with each other and zeroed-out the rest; ridge regression kept all 393 but down weighted each term. In this way, penalized regression allows for tuning large models adapting them to the information provided by the data.
RF is a recursive partitioning method popular in many fields with high-dimensional data (e.g., genomics). RF can evaluate a number of predictor variables even in the presence of complex interactions, including those that are not possible to model using regression. RF is an ensemble of classification and regression trees constructed on bootstrap samples. Unlike individual trees, RF is more protective against overfitting.
An ANN uses interconnected nodes within various layers to explain an outcome given a set of predictors. The relationships between the nodes are defined by weights calculated using a given rule. The initial weights are preassigned by the analyst. The ANN algorithm iterates adjusting the weights. At the end of each iteration, the performance at outcome prediction is evaluated. ANNs efficiently generate non-linear classification rules but can be prone to overfitting. More recently, some types of ANN are referred to as deep learning [31]. Deep learning allows modeling multiple levels of non-linearity in the data and is scalable to large datasets and big data in general. We used deep learning ANN with hidden layer sizes of 200.
SL is a generalization of the stacking algorithm [32], an ensemble machine learning method that takes a weighted average of all other algorithms considered for prediction and produces a single prediction function (PF) with optimal tradeoff between variance and bias. SL is very flexible for learning from the data as it combines the strengths of all methodologies considered (including different configurations of the same algorithm) while minimizing modelling flaws. Another advantage of SL is that it eliminates the need to select a priori a single or a few methodologies for the analysis. SL allows analyzing the data using simultaneously all the methodologies the researcher considers suitable. Fig 1 depicts the work flow used to analyze the data and details all the steps necessary for running SL. The input of the analysis consists of the training and test data sets together with the algorithmic library. Since we used 2-fold CV to obtain each algorithm PF, as well as, the SL PF, the training set was initially partitioned in 2 blocks. Each algorithm in the library was fitted using each block independently. We used the data block excluded from the model fitting to calculate the CV AUC for each algorithm in the library. We averaged both CV AUCs, resulting in a single training set CV AUC for each algorithm. Up to this step, model fitting follows a regular 2-fold CV modelling path. The PF of discrete SL, a simpler version of SL, is the PF of the algorithm with the minimum CV AUC. However, SL performs better when a weighted combination of the algorithms’ PFs is used. Thus, the next step for obtaining SL PF is to calculate a weight for each algorithm PF. This is done regressing Y on the values of Y predicted by each algorithm in the library. Next, each algorithm is fitted using the whole training set and the SL PF is obtained by applying the estimated weights to the algorithm predictions for each observation. It can be demonstrated that using 2-fold CV, the procedure can end here and the AUCs of fitting each algorithm and SL to the whole training dataset would suffice for SL to outperform the rest of the algorithms without overfitting. However, we included an additional validation step: all the PFs obtained with the training set, where used to predict SUD treatment success in the test data set. We calculated AUCs compared to evaluate all models adjusted using the test dataset. A thorough description of all the algorithms used in this work is out of the scope of this manuscript. The interested reader will find further details about SL in van der Laan and Rose [33] and about the rest of the aforementioned methodologies in Friedman et al [29] and Bengio [31]. Models were fitted using the open source R programming language [34] and the H2O R interface version 3.8.2.2 [35] that optimizes all the analytical methods used for large datasets. The h2oEnsemble package version 0.1.8 [36] was used to fit the SL model. We set all tuning parameters for each algorithm in the SL library (e.g., the ANN implementation used in this manuscript has over 20 parameters) to their default values. The analysis took about 2.5 hours to run in a Windows 7 Professional desktop computer with 8-core i7-3770 3.40 Ghz CPU and 8 Gb RAM. Most of the analytic time was used to fit the four regression models including 2-way interaction terms. The other 13 algorithms, including SL, required only about 6 minutes of the 2.5 hours. AUC confidence intervals and variances were estimated using the Delong and colleagues methodology [37] as implemented in the R package pROC [38]. Briefly, Delong et al [37] used the equality between AUC and the Mann-Whitney U statistic and asymptotic normality to analytically derive variance and confidence interval estimators for AUC.
|