PropertyValue
is nif:broaderContext of
nif:broaderContext
is schema:hasPart of
schema:isPartOf
nif:isString
  • Partial least squares (PLS) regression is a method for modeling a relationship between two sets of multivariate data via a latent space, and of performing least squares regression in that space. PLS can handle high-dimensional co-linear datasets because of its underlying assumption that the two datasets are generated by a small number of latent components. In this process, latent components are formed by maximizing the covariance between the two datasets. Partial Least Squares Regression (PLS): PLS models a linear relationship between two blocks of variables {xi}i=1n∈Rp and {yi}i=1n∈Rq. In the following parts, X = (x1, …, xn)T represents the (n × p) predictor matrix and Y = (y1, …, yn)T represents the (n × q) response matrix. This procedure obtains L latent components as {ti}i=1L and {ui}i=1L. This technique assumes following decomposition: X=TPT+Fx(1) Y=UQT+Fy,(2) where both T = (t1, …, tL) and U = (u1, …, uL) are the (n × L) matrices of L latent components corresponding to X and Y, respectively. The (p × L) matrix P and the (q × L) matrix Q are loadings and the (n × p) matrix Fx and the (n × q) matrix Fy are the matrices of residuals. Since our objective is to perform least squares regression in a low-dimensional latent space, the underlying assumption is that the latent component ui can be well predicted from ti from a relation such as: U=TD,(3) where D is the (L × L) matrix. We need to maximize the covariance between ti and ui to satisfy the above assumption. Our objective criterion is maxt,ucov(t,u)=maxw,ccov(Xw,Yc),(4) where w∈Rp and c∈Rq are weight vectors for projection into the latent components. After extracting the latent component, the observation matrices X and Y are deflated by subtracting their rank-one approximation. It is important to stress the asymmetry scheme, i.e. that Y is deflated based on t, in the case of regression. By repeating the above procedures L times, we obtain the weight matrices W = (w1, …, wL) and C = (c1, …, cL). Finally, the relation in the original data space is expressed by Y=XB+E,(5) where B is the (p × q) matrix of regression coefficients and E is the (n × q) matrix of residuals. Plugging the relationship B = W(PTW)−1CT [27, 28] into Eq (5), we obtain a different representation of Y as: Y^=XB(6) =XW(PTW)-1CT(7) =XXTU(TTXXTU)-1TTY. (8) The final transformation is derived from the following equalities [29], W=XTU,(9) P=XTT,(10) C=YTT. (11) Note that tiTtj=δij (the Kronecher delta) takes the values 1 for i = j and 0 for i ≠ j as a consequence of the algorithm. In general, B is obtained from a centered training dataset. The response ynew for a new subject xnew, referred to as test dataset, is then estimated as follows: ynew=y¯+BT(xnew-x¯),(12) where y¯ and x¯ represent the mean predictor and response in the training dataset, respectively. A schematic outline of PLS is illustrated in Fig 1 and S1 Appendix. Figure data removed from full text. Figure identifier and caption: 10.1371/journal.pone.0179638.g001 Schematic illustration of partial least squares regression.Two blocks of data, X and Y, are projected by w and c onto latent components, t and u, and least squares regression is performed. p and q represent loading vectors. Kernel Partial Least Squares Regression (KPLS): Linear PLS is easily extended to nonlinear regression using a kernel trick [28, 30]. Let ϕ:Rp→H be a nonlinear transformation of the predictor, x∈Rp, into a feature vector, ϕ(x)∈H, where H is a high-dimensional feature space. Define a Gram matrix K as inner products of points in feature space, i.e., K = ΦΦT, where Φ = (ϕ(x1), …, ϕ(xn))T represents the predictor matrix in feature space. In general, the number of columns of Φ is so large that with the explicit form of Φ, we can not perform the same procedure as in the linear case. However, due to the kernel trick, the explicit form of Φ becomes unnecessary. The deflation procedure is performed as follows: K←(In-ttT)K(In-ttT)(13) Y←Y-ttTY,(14) where In represents an n-dimensional identity matrix. We obtain the prediction on the training data from: Y^=ΦB(15) =ΦΦTU(TTΦΦTU)-1TTY(16) =KU(TTKU)-1TTY. (17) To exclude the bias term, we assume that the responses and the predictors are set to have zero mean in the feature space by applying the following procedure to test kernel Kt and training kernel K [31]: Kt←(Kt-1n1nt1nTK)(In-1n1n1nT)(18) K←(In-1n1n1nT)K(In-1n1n1nT),(19) where 1n represents the n-length vector whose n elements are 1. Note that n and nt represent the number of training and test samples, respectively. In the following section of this paper, we investigate three kernel functions: 1) a second order polynomial kernel k(x, x′) = (xTx′ + 1)2, referred to as KPLS-Poly(2), 2) a third order polynomial kernel k(x, x′) = (xTx′ + 1)3, referred to as KPLS-Poly(3), 3) a Gaussian kernel k(x, x′) = exp(−γ||x − x′||)2), referred to as KPLS-Gauss, where γ is a hyper parameter and set to the inverse of the median of the Euclidian distance of data points. In addition to predicting clinical measures, our aim is to classify subjects into depressed patients and healthy controls using the predicted value of clinical measures for objective diagnosis. We evaluate generalization of binary classifiers using linear discriminant analysis (LDA). Given the training data Dtr={xtr,ytr,ztr} and test data Dte={xte,yte,zte}, x∈Rp, y∈Rq, and z ∈ {0, 1} represent functional connectivity as predictors, clinical measures as responses, and binary labels (i.e. 0 is patients and 1 is healthy controls), respectively. In the prediction phase, our objective is to learn the function fB:Rp→Rq, which, given predictors, xtr, and responses, ytr, assigns predictors to the most probable values of y. The prediction on the training dataset is y^tr=fB(xtr). In the next classification phase, our objective is to learn the classifier gw:Rq→{0,1}, which, given predicted responses, y^tr, and binary labels, ztr, assigns predicted responses to the most probable labels. Assigned labels on the test dataset are obtained as z^te=gw(y^te)=gw(fB(x^te)). It is important to stress that the binary classifier is not trained on actual clinical measures, ytr, but on predicted values of y^tr. In a previous study [13], the authors only identified the binary classifier gw′:Rp→0,1, which, given functional connectivity, xtr, and binary labels, ztr, assigns functional connectivity directly to binary labels. By exploiting the predicted result of clinical measures, it may be possible to improve classification performance. We compared two scenarios, i.e. i) classification of patients and healthy controls using LDA from predicted clinical measures with KPLS (with KPLS-Gauss, KPLS-Poly(3), and KPLS-Poly(2)), PLS, and ordinary least squares regression (OLS), ii) classification of patients and healthy controls by means of LDA and SVM from functional connectivity directly. Note that we perform feature selection before scenario 2) by calculating connection-wise t-tests to determine the connections with different group means, represented by t-scores. We select the M functional connections with the highest absolute t-scores. M is optimized by cross validations. Even though PLS can cope with high-dimensional, co-linear datasets, we prescreened variables depending on their relevance to responses in the following way. Based on Pearson correlation coefficients, ρrl, between the r-th functional connection and the l-th clinical measures, we define the empirical relevance of the r-th functional connection as Rr=∑l=14ρrl2,r=1,…,p,(20) where p is the total number of functional connections. These functional connections are ranked according to their empirical relevance, {Rr}r=1p, and only M relevant functional connections are used in following procedure. The optimal number for M was determined through nested leave-one-out cross-validation. Conventionally, cross validation is employed to assure generalization ability of a model or to evaluate optimal parameters. Since we have to account for both generalization ability and parameter optimization, we made use of nested leave-one-out cross validation (LOOCV), which consisted of outer and inner LOOCV. The outer LOOCV repeats iterations that split the whole set of samples into a single outer validation sample used to evaluate the generalization ability, and an outer training set for model estimation. The inner loop of LOOCV is performed on the outer training set to optimize two parameters, M and L, the number of selected predictor variables and the number of components, respectively. The pair of parameters that achieves the lowest root mean squared error based on the inner validation sample are adopted as optimal parameters and used to evaluate the model using the outer LOOCV. These steps are repeated until each sample has served as the validation sample. Age is significantly correlated with three clinical measures (Table 2). In general, age matching performed on different diagnostic groups reduces sample size, causing poor performance. To avoid this problem, we investigated three models, i.e. (i) a model with age as a response (denoted by output-age), (ii) a model with age as a predictor (denoted by input-age), and (iii) a model without age (denoted by no-age). By incorporating age into our model, we can cope with age differences among subjects and can fairly evaluate prediction performance. Interpretation of each latent component projected from input and output data gives novel insights into the relationship between functional connectivity and clinical measures. In the framework of PLS, loading matrices, P and C, indicate contributions from predictor variables and response variables to each latent component (see Eqs (10) and (11)). The (i, j)-element of the loading matrix, P, represents the contribution of the i-th functional connection to the j-th latent component. Similarly, the (i, j)-element of the loading matrix, C, represents the contribution of the i-th clinical measure to the j-th latent component. Note that due to subject variability, values of Pij and Cij vary depending on the training set used.
rdf:type