nif:isString
|
-
In this paper, we used the cancer genomic and drug response data from the Cancer Cell Line Encyclopedia (CCLE). The CCLE dataset (http://www.broadinstitute.org/ccle) consists of large-scale genomics data, including expression profile of 20089 genes, mutation status of 1667 genes, copy number variation of 16045 genes for 947 human cancer cell lines, and 8-point dose-response curves for 24 chemical drugs across 479 cell lines. Drug sensitivities to a given cell line are evaluated by IC50, EC50, and activity area (the area over dose-response curves). In this study, activity area is used as a drug sensitivity measurement due to its ability to capture the efficacy and potency of drug sensitivity simultaneously compared to IC50 and EC50.
QRFs for 24 chemical compounds were trained based on three types of genomic features, including gene expressions, mutation status and copy number variation. Motivated by Riddick et al. [2], we designed a three-step quantile regression forest method as follows (Fig 2). In the first stage, some important genomic features were filtered through the correlation test. Then, a random forest was trained on the filtered features and variables are ranked and furtherly selected based on their importance. Finally, QRF is built using selected features.
Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0205155.g002 Workflow of the three-step quantile regression forest method.All features were screened by their Pearson correlations with drug response. Then a random forest was trained to rank selected features by their importance. The variables with the importance of twice standard deviation greater than the mean of importance were selected for the final quantile regression forest.
Feature screening by Pearson correlation coefficient: Due to the ultra-high dimension of genomic features in CCLE dataset, building QRFs directly from all available features is difficult and time-consuming. So a screening method was first applied to filter the potentially important features [16]. We calculated the Pearson correlation coefficients (PCCs) of genomic features with drug responses and ranked the importance of features by p-values of two-sided t-tests for PCCs. Features with p-value under 0.05 were then selected. For each drug in CCLE, the above feature screening process ranked the marginal importance of genomic features and selected around 2000 genes.
Variable selection by random forests: We next trained a random forest on the recruited genomic features and further selected a small subset of variables based on the generated variable importance. In detail, each tree in the random forest is a bootstrap sample from the original data, and some observations are not in the bootstrap sample, called the out-of-bag (OOB) cases. For each tree, the prediction performance (measure by residual sum of squares) on the OOB proportion of data is recorded. The same procedure is done after permuting the values of each variable. Decrease of the prediction performance after permutation averaged over all the trees is taken as a measurement of variable importance. In this stage, we trained 25000 trees for the random forest. The variables were selected if their importance values are 2 * SD above the mean of all variable importance values.
We first denote the τ-th quantile of Y given X = x by qτ(X|Y = x). For X = x, the conditional distribution function F(y|X = x) is the probability that Y is smaller than or equal to y ∈ R, i.e., F(y|X=x)=P(Y≤y|X=x). For a continuous distribution, the τ-th quantile qτ(X|Y = x) is defined as “y” such that F(y|X = x) = τ. While this definition cannot be extended to all cases, especially to discrete distributions, although the drug response (activity area) in this paper is continuous. In general, it is accepted that qτ(Y|X = x) = inf{y:F(y|X = x) ≥ τ}. Quantile regression forests (QRFs) [17] estimate the conditional quantiles of response (Y) given the features (X) by building random forests. To build a QRF, a set of T un-pruned regression trees are generated based on bootstrap sampling from the original data. In this paper, we used T = 15000. For each node of the regression trees, a random set of m features selected from the whole set of M features is used for fitting a regression tree based on the bootstrap samples. In this paper, m was taken as M/3 as suggested by Hastie et al. [18]. In the tree generating process, a node with less than 10 training samples is not portioned any more [17]. Then the conditional distribution is estimated by the weighted distribution of the observed response variables. More specifically, we consider the conditional distribution of Y given X = x based on the tree Ψ. Suppose the leaf which contains x is denoted by Ln(x,Ψ), then the weight ωi(x,Ψ) is given by ωi(x,Ψ)=I(Xi∈Ln(x,Ψ))#{j:Xj∈Ln(x,Ψ)}. Let the T trees of the random forests be Ψ1.…,ΨT, and ωi(x) be the average of ωi(x,Ψ) over all the trees. Then ωi(x)=1T∑t=1Tωi(x,Ψt). The estimated F^(y|X=x) is then given by F^(y|X=x)=∑i=1nωi(x)I(Yi≤y). (1) Then τ-th quantile qτ(x) is predicted by q^τ(Y|X=x)=inf{y:F^(y|X=x)≥τ}. On the other hand, based on the generated trees in QRF, if we make a small change to the right side of formula (1), i.e., ∑i=1nωi(x)Yi, then the mean (expectation) of Y given X = x is predicted. So besides the conditional quantiles, QRFs can easily predict the conditional mean of response Y. In this study, we implemented QRFs by R package “quantregForest” (version 0.2–3) and assessed the variable importance by permutation used in the original random forest algorithm.
The prediction intervals are constructed from the conditional quantiles of drug response predicted by QRFs. In detail, the (1 − α) × 100% prediction interval for drug response Y given genomic features X (a p-dimensional vector) is built by I(x) = [qα/2(Y|X = x), q1−α/2(Y|X = x)]. For example, the 95% prediction interval for drug response is estimated by I(x)=[q0.025(Y|X=x),q0.975(Y|X=x)]. It means that for a given x, drug response locates in the interval I(x) with high probability. The length of the prediction interval fluctuates with X.
Performance evaluation of prediction intervals: For each observation, we can obtain OOB prediction using the trees not containing this observation. OOB prediction is virtually equivalent to the prediction by cross validation when the number of trees is large [19]. In this study, we generated 15000 trees in QRFs and evaluated the prediction performance by QRFs using OOB prediction, avoiding the intensive computation of 10-fold cross validation [4]. For the point predictions of drug responses by QRFs, the Pearson correlation coefficients between the observed and predicted (OOB) values were used to quantify the accuracy. But the true conditional quantiles of drug responses are unobservable. So as suggested by Wang et al. [20], the prediction error of the τ-th conditional quantile was assessed based on the average of the value ρτ(Y−q^τ(Y|X=x)) over all observations, where ρτ(r) = τr − rI(r < 0) is called the quantile loss function, for 0 < τ < 1[10,20].
Homogeneity test of variances for drug responses: Given the prediction intervals, we next consider prioritizing patients with very close point predictions of drug response to the same drug. As aforementioned, a shorter prediction interval indicates the higher stability of prediction at the same confidence level. Thus, the length of prediction interval reflects the prediction reliability. In this circumstance, patients with longer prediction intervals need further consideration and expect other medical plans. The strategy is intuitive but not statistically rigorous, so we used the homogeneity test of variances as a complement. Note that for each drug, we have only one response value for a specific patient (cell line), thus do not have enough replicates to carry out the homogeneity test directly. However, taking advantage of our random forest model, we have estimated the quantiles of drug response qτ(Y|X = x) for each cell line by the quantile regression forest. In detail, for any τ between 0 and 1, we have obtained the estimate of qτ(Y|X = x). The quantile function of drug response equals to the inverse of cumulative distribution function since drug response is continuously distributed. This inspires us to use the inverse transform sampling to get the samples of drug responses [21]. In detail, for patient with geometric features X = x, we firstly generate different τ′s denoted as {τ1,..,τk} from the standard uniform distribution U[0,1]. Then the corresponding {qτ1(Y|X=x),…qτk(Y|X=x)} was taken as the random samples of drug responses for the considered patient. However, qτ(Y|X = x) is unknown, we thus treat {q^τ1(Y|X=x),…q^τk(Y|X=x)} as the unobserved {qτ1(Y|X=x),…,qτk(Y|X=x)} as the random samples and carry out the rest analyses. For simplicity, we denote q^τ(Y|X=x)asY* in the remaining content. Assume that patients 1 and 2 have almost the same point predictions for a certain drug. By the inverse transform sampling, after randomly sampling τij′s from U[0,1] (i = 1,2,j = 1,…,k), we get {Y11*,Y11*,……,Y1k*} and {Y21*,Y21*,……,Y2k*} for patients 1 and 2, respectively. Then we consider the following hypothesis problem H0:σ12=σ22v.s.H1:σ12≠σ22. Here σ12 and σ22 are the variances of drug responses for patients 1 and 2. We then use the Levene test [22], which is robust to the non-normal distributed samples, to test the difference of variances. Note that Levene test is not only restricted to two-group comparison, so we can also discuss the multi-patient test problem in the future. Similarly, the homogeneity test can also be used to compare different drugs for the same patient. In this paper, the significance level was set to 0.05. We tried different values for k, and did not observe significant changes at different values of k. So we finally took k as 500.
|