Abstract
The selection of the optimal number of components remains a difficult but essential task in partial least squares (PLS). Randomization tests have the advantage of being automatic and they make use of the entire dataset, in contrary with the widely used cross-validation approaches. Partial least squares modeling may include component(s) with a large amount of irrelevant data variation, and this might affect the model, depending on the assigned y-loading (which is the regression coefficient in the latent domain). This has recently been indicated by us in the basic sequence framework with respect to the underlying theory of the PLS algorithm and presented to the chemometrics society. We will show in this work that this irrelevant data variation is the root cause of the difficulty in current methods for selecting the optimal number of components. For randomization tests, PLS models with nonsignificant components may result in false positive tests because of the incorrect assumption that "the components enter the model in a natural order". In this work, we introduce a new randomization test, weight randomization test, selection of the optimal number of components in PLS in light of the underlying theory of the PLS algorithm. In the proposed method the null distribution is well characterized and efficiently determined taking into account a newly defined model quality metric: the number of consecutive non-significant components (CNC). We illustrate the effectiveness of weight randomization test in optimization of preprocessing as well as in classification models, where results are compared with the double cross-validation procedure for the latter. This is an important step towards the full automation of PLS model development and routine updates.
Original language | English |
---|---|
Article number | e2887 |
Number of pages | 15 |
Journal | Journal of Chemometrics |
Volume | 31 |
Issue number | 5 |
DOIs | |
Publication status | Published - May 2017 |
Keywords
- number of components
- partial least squares
- randomization test
- ION MOBILITY SPECTROMETRY
- PARTIAL LEAST-SQUARES
- MULTIVARIATE CALIBRATION
- VARIABLE IMPORTANCE
- REGRESSION-MODELS
- CROSS-VALIDATION
- CHEMOMETRICS
- DISTRIBUTIONS
- OPTIMIZATION
- SPECTROSCOPY