My last entry introduces principal component analysis (PCA), one of many unsupervised learning tools. I concluded the post with a demonstration of principal component regression (PCR), which essentially is a ordinary least squares (OLS) fit using the first principal components (PCs) from the predictors. This brings about many advantages:
- There is virtually no limit for the number of predictors. PCA will perform the decomposition no matter how many variables you handle. Simpler models such as OLS do not cope with more predictors than observations (i.e. ).
- Correlated predictors do not undermine the regression fit. Collinearity is a problem for OLS, by widening the solution space, i.e. destabilizing coefficient estimation. PCA will always produce few uncorrelated PCs from a set of variables, correlated or not.
- The PCs carry the maximum amount of variance possible. In any predictive model, predictors with zero or near-zero variance often constitute a problem and behave as second intercepts. In the process of compression, PCA will find the projections where your data points spread out the most, thus facilitating learning from the subsequent OLS.
However, in many cases it is much wiser performing a decomposition similar to the PCA, yet constructing PCs that best explain the response, be it quantitative (regression) or qualitative (classification). This is the concept of partial least squares (PLS), whose PCs are more often designated latent variables (LVs), although in my understanding the two terms can be used interchangeably.
PLS safeguards advantages 1. and 2. and does not necessarily follow 3. In addition, it follows that if you use the same number of LVs and predictors, you are going to get exactly a OLS fit – each predictor gets its own LV, so it does not help much. PLS (regression) and PLS followed by discriminant analysis (PLS-DA, classification) are tremendously useful in predictive modelling. They are adequate in a wide variety of experimental designs and linear in their parameters, therefore more easily interpretable.
Today we will perform PLS-DA on the Arcene data set hosted at the UCI Machine Learning Repository that comprises 100 observations and 10,000 explanatory variables () in order to diagnose cancer from serum samples. From the 10,000 features, 7,000 comprise distinct mass spectrometry (MS) peaks, each determining the levels of a protein. For some reason the contributors added 3,000 random variables, probably to test robustness against noise (check more information in the link above).
Let’s get started with R
For predictive modelling I always use the caret package, which builds up on existing model packages and employs a single syntax. In addition, caret features countless functions that deal with pre-processing, imputation, summaries, plotting and much more we will see firsthand shortly.
For some reason there is an empty column sticking with the data set upon loading, so I have added the colClasses argument to get rid of it. Once you load the data set take a look and note that the levels of most proteins are right-skewed, which can be a problem. However, the authors of this study conducted an exhaustive pre-processing and looking into this would take far too long. The cancer / no-cancer labels (coded as -1 / 1) are stored in a different file, so we can directly append it to the full data set and later use the formula syntax for training the models.
We can finally search missing values (NA’s). If these were present it would require some form of imputation – that is not the case. Calling any(is.na(arcene)) should prompt a FALSE.
So now the big questions are:
- How accurately can we predict whether a patient is sick based of the MS profile of his/her blood serum?
- Which proteins / MS peaks best discriminate sick and healthy patients?
Time to get started with the function train. With respect to pre-processing we will remove zero-variance predictors and center and scale all those remaining, in this exact order using the preProc argument. Considering the size of the sample (n = 100), I will pick a 10x repeated 5-fold cross-validation (CV) – a large number of repetitions compensates for the high variance stemming from a reduced number of folds – rendering a total of 50 estimates of accuracy. Many experts advise in favor of simpler models when performance is undistinguishable, so we will apply the “one SE” rule that selects the least complex model with the average cross-validated accuracy within one standard error (SE) from that in the optimal model. Finally, we must define a range of values for the tuning parameter, the number of LVs. In this particular case we will consider all models with 1-20 LVs.
This figure depicts the CV profile, where we can learn the average accuracy (y-axis, %) obtained from models trained with different numbers of LVs (x-axis). Note the swift change in accuracy among models with 1-5 LVs. Although the model with six LVs had the highest average accuracy, calling mod1 in the console will show you that, because of the “one SE” rule, the selected model has five LVs (accuracy of 80.6%).
Now we will compare our PLS-DA to the classifier homolog of PCR – linear discriminant analysis (LDA) following a PCA reductive-step (PCA-DA, if such thing exists). Note that the PCA pre-processing is also set in the preProc argument. We can also try some more random forests (RF). I will not go into details about the workings and parameterisation of RF, as for today it will be purely illustrative. Note that RF will comparatively take very long (~15min in my 8Gb MacBook Pro from 2012). As usual, a simple object call will show you the summary of the corresponding CV.
Finally, we can compare PLS-DA, PCA-DA and RF with respect to accuracy. We will compile the three models using caret::resamples, borrowing the plotting capabilities of ggplot2 to compare the 50 accuracy estimates from the optimal cross-validated model in the three cases.
It is clear that the long RF run did not translate into a excelling performance, quite the opposite. Although in average all three models have similar performances, the RF displays a much larger variance in accuracy, which is of course a concern if we seek a robust model. In this case PLS-DA and PCA-DA exhibit the best performance (63-95% accuracy) and either model would do well in diagnosing cancer in new serum samples.
To conclude, we will determine the ten proteins that best diagnose cancer using the variable importance in the projection (ViP), from both the PLS-DA and PCA-DA. With varImp, this is given in relative levels (scaled to the range 0-100).
Very different figures, right? Idiosyncrasies aside, how many PCs did the PCA-DA use? By calling mod2$preProcess, we learn that “PCA needed 82 components to capture 95 percent of the variance”. This is because the PCA pre-processing functionality in caret uses, by default, as many PCs as it takes to cover 95% of the data variance. So, in one hand we have a PLS-DA with five LVs, on the other hand a PCA-DA with 82 PCs. This not only makes PCA-DA cheaply complicated, but arcane at the interpretation level.
The PLS-DA ViP plot above clearly distinguishes V1184 from all other proteins. This could be an interesting cancer biomarker. Of course, many other tests and models must be conducted in order to provide a reliable diagnostic tool. My aim here is only to introduce PLS and PLS-DA.
And that’s it for today, I hope you enjoyed. Please drop me a comment or message if you have any suggestion for the next post. Cheers!