Constructing bi-plots for random forest: Tutorial

Lionel Blanchet; Raffaele Vitale; Robert van Vorstenbosch; George Stavropoulos; John Pender; Daisy Jonkers; Frederik-Jan van Schooten; Agnieszka Smolinska

doi:10.1016/j.aca.2020.06.043

Constructing bi-plots for random forest: Tutorial

Lionel Blanchet, Raffaele Vitale, Robert van Vorstenbosch, George Stavropoulos, John Pender, Daisy Jonkers, Frederik-Jan van Schooten, Agnieszka Smolinska^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

553 Downloads (Pure)

Abstract

Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group.

The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them. (c) 2020 Elsevier B.V. All rights reserved.

Original language	English
Pages (from-to)	146-155
Number of pages	10
Journal	Analytica Chimica Acta
Volume	1131
DOIs	https://doi.org/10.1016/j.aca.2020.06.043
Publication status	Published - 22 Sept 2020

Keywords

Random forest interpretation
Pseudo samples
Bi-plots
Proximity matrix
Principal coordinates analysis
PSEUDO-SAMPLE TRAJECTORIES
PARTIAL LEAST-SQUARES
FAULT-DIAGNOSIS
KERNEL

Access to Document

10.1016/j.aca.2020.06.043

Full TextFinal published version, 2.06 MBLicence: Taverne

Cite this

@article{0ffd489e17f94f2fb781d6eeebe3707e,

title = "Constructing bi-plots for random forest: Tutorial",

abstract = "Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group.The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them. (c) 2020 Elsevier B.V. All rights reserved.",

keywords = "Random forest interpretation, Pseudo samples, Bi-plots, Proximity matrix, Principal coordinates analysis, PSEUDO-SAMPLE TRAJECTORIES, PARTIAL LEAST-SQUARES, FAULT-DIAGNOSIS, KERNEL",

author = "Lionel Blanchet and Raffaele Vitale and {van Vorstenbosch}, Robert and George Stavropoulos and John Pender and Daisy Jonkers and {van Schooten}, Frederik-Jan and Agnieszka Smolinska",

note = "Funding Information: A/Prof John Penders is an expert in molecular epidemiology and microbial ecology. His research group integrates metagenomic methods within the context of prospective epidemiological studies using various longitudinal statistical and bioinformatics tools to elucidate the role of the microbiome in health and disease. His group (at Maastricht University, The Netherlands) is currently funded by The Netherlands Organization for Scientific Research, The Netherlands Organization for Health Research and Development and the Joint Programming Initiative on Healthy Diet for Healthy Living. He has authored more than 120 publications, including in leading journals like Nature Biotechnology, Lancet Infectious Diseases, Gastroenterology, Gut, Mucosal Immunology and Microbiome. Funding Information: This work was supported by Netherlands Organisation for Scientific Research (NWO, the Netherlands) (grant number: 016.Veni.178.064 ). Publisher Copyright: {\textcopyright} 2020 Elsevier B.V.",

year = "2020",

month = sep,

day = "22",

doi = "10.1016/j.aca.2020.06.043",

language = "English",

volume = "1131",

pages = "146--155",

journal = "Analytica Chimica Acta",

issn = "0003-2670",

publisher = "Elsevier Science",

}

TY - JOUR

T1 - Constructing bi-plots for random forest

T2 - Tutorial

AU - Blanchet, Lionel

AU - Vitale, Raffaele

AU - van Vorstenbosch, Robert

AU - Stavropoulos, George

AU - Pender, John

AU - Jonkers, Daisy

AU - van Schooten, Frederik-Jan

AU - Smolinska, Agnieszka

N1 - Funding Information: A/Prof John Penders is an expert in molecular epidemiology and microbial ecology. His research group integrates metagenomic methods within the context of prospective epidemiological studies using various longitudinal statistical and bioinformatics tools to elucidate the role of the microbiome in health and disease. His group (at Maastricht University, The Netherlands) is currently funded by The Netherlands Organization for Scientific Research, The Netherlands Organization for Health Research and Development and the Joint Programming Initiative on Healthy Diet for Healthy Living. He has authored more than 120 publications, including in leading journals like Nature Biotechnology, Lancet Infectious Diseases, Gastroenterology, Gut, Mucosal Immunology and Microbiome. Funding Information: This work was supported by Netherlands Organisation for Scientific Research (NWO, the Netherlands) (grant number: 016.Veni.178.064 ). Publisher Copyright: © 2020 Elsevier B.V.

PY - 2020/9/22

Y1 - 2020/9/22

N2 - Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group.The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them. (c) 2020 Elsevier B.V. All rights reserved.

AB - Current technological developments have allowed for a significant increase and availability of data. Consequently, this has opened enormous opportunities for the machine learning and data science field, translating into the development of new algorithms in a wide range of applications in medical, biomedical, daily-life, and national security areas. Ensemble techniques are among the pillars of the machine learning field, and they can be defined as approaches in which multiple, complex, independent/uncorrelated, predictive models are subsequently combined by either averaging or voting to yield a higher model performance. Random forest (RF), a popular ensemble method, has been successfully applied in various domains due to its ability to build predictive models with high certainty and little necessity of model optimization. RF provides both a predictive model and an estimation of the variable importance. However, the estimation of the variable importance is based on thousands of trees, and therefore, it does not specify which variable is important for which sample group.The present study demonstrates an approach based on the pseudo-sample principle that allows for construction of bi-plots (i.e. spin plots) associated with RF models. The pseudo-sample principle for RF. is explained and demonstrated by using two simulated datasets, and three different types of real data, which include political sciences, food chemistry and the human microbiome data. The pseudo-sample bi plots, associated with RF and its unsupervised version, allow for a versatile visualization of multivariate models, and the variable importance and the relation among them. (c) 2020 Elsevier B.V. All rights reserved.

KW - Random forest interpretation

KW - Pseudo samples

KW - Bi-plots

KW - Proximity matrix

KW - Principal coordinates analysis

KW - PSEUDO-SAMPLE TRAJECTORIES

KW - PARTIAL LEAST-SQUARES

KW - FAULT-DIAGNOSIS

KW - KERNEL

U2 - 10.1016/j.aca.2020.06.043

DO - 10.1016/j.aca.2020.06.043

M3 - Article

C2 - 32928475

SN - 0003-2670

VL - 1131

SP - 146

EP - 155

JO - Analytica Chimica Acta

JF - Analytica Chimica Acta

ER -