Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation

G. Stavropoulos; R. van Vorstenbosch; D.M.A.E. Jonkers; J. Penders; J.E. Hill; F.J. van Schooten; A. Smolinska

doi:10.1016/j.aca.2021.339001

Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation

G. Stavropoulos, R. van Vorstenbosch, D.M.A.E. Jonkers, J. Penders, J.E. Hill, F.J. van Schooten, A. Smolinska^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

75 Downloads (Pure)

Abstract

Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the ex-amination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted. (c) 2021 Published by Elsevier B.V.

Original language	English
Article number	339001
Number of pages	11
Journal	Analytica Chimica Acta
Volume	1183
DOIs	https://doi.org/10.1016/j.aca.2021.339001
Publication status	Published - 23 Oct 2021

Keywords

Data fusion
Proximities
Stacking
Variable behaviour
Crohn's disease
Classification
PARTIAL LEAST-SQUARES
CLASSIFICATION
SELECTION
NMR

Access to Document

10.1016/j.aca.2021.339001

Full TextFinal published version, 1.33 MBLicence: Taverne

Cite this

@article{e1aa7ee9bcfa4179bf8963416eea71f2,

title = "Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation",

abstract = "Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the ex-amination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted. (c) 2021 Published by Elsevier B.V.",

keywords = "Data fusion, Proximities, Stacking, Variable behaviour, Crohn's disease, Classification, PARTIAL LEAST-SQUARES, CLASSIFICATION, SELECTION, NMR",

author = "G. Stavropoulos and {van Vorstenbosch}, R. and D.M.A.E. Jonkers and J. Penders and J.E. Hill and {van Schooten}, F.J. and A. Smolinska",

year = "2021",

month = oct,

day = "23",

doi = "10.1016/j.aca.2021.339001",

language = "English",

volume = "1183",

journal = "Analytica Chimica Acta",

issn = "0003-2670",

publisher = "Elsevier Science",

}

TY - JOUR

T1 - Advanced data fusion: Random forest proximities and pseudo-sample principle towards increased prediction accuracy and variable interpretation

AU - Stavropoulos, G.

AU - van Vorstenbosch, R.

AU - Jonkers, D.M.A.E.

AU - Penders, J.

AU - Hill, J.E.

AU - van Schooten, F.J.

AU - Smolinska, A.

PY - 2021/10/23

Y1 - 2021/10/23

N2 - Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the ex-amination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted. (c) 2021 Published by Elsevier B.V.

AB - Data fusion has gained much attention in the field of life sciences, and this is because analysis of biological samples may require the use of data coming from multiple complementary sources to express the samples fully. Data fusion lies in the idea that different data platforms detect different biological entities. Therefore, if these different biological compounds are then combined, they can provide comprehensive profiling and understanding of the research question in hand. Data fusion can be performed in three different traditional ways: low-level, mid-level, and high-level data fusion. However, the increasing complexity and amount of generated data require the development of more sophisticated fusion approaches. In that regard, the current study presents an advanced data fusion approach (i.e. proximities stacking) based on random forest proximities coupled with the pseudo-sample principle. Four different data platforms of 130 samples each (faecal microbiome, blood, blood headspace, and exhaled breath samples of patients who have Crohn's disease) were used to demonstrate the classification performance of this new approach. More specifically, 104 samples were used to train and validate the models, whereas the remaining 26 samples were used to validate the models externally. Mid-level, high-level, as well as individual platform classification predictions, were made and compared against the proximities stacking approach. The performance of each approach was assessed by calculating the sensitivity and specificity of each model for the external test set, and visualized by performing principal component analysis on the proximity matrices of the training samples to then, subsequently, project the test samples onto that space. The implementation of pseudo-samples allowed for the identification of the most important variables per platform, finding relations among variables of the different data platforms, and the ex-amination of how variables behave in the samples. The proximities stacking approach outperforms both mid-level and high-level fusion approaches, as well as all individual platform predictions. Concurrently, it tackles significant bottlenecks of the traditional ways of fusion and of another advanced fusion way discussed in the paper, and finally, it contradicts the general belief that the more data, the merrier the result, and therefore, considerations have to be taken into account before any data fusion analysis is conducted. (c) 2021 Published by Elsevier B.V.

KW - Data fusion

KW - Proximities

KW - Stacking

KW - Variable behaviour

KW - Crohn's disease

KW - Classification

KW - PARTIAL LEAST-SQUARES

KW - CLASSIFICATION

KW - SELECTION

KW - NMR

U2 - 10.1016/j.aca.2021.339001

DO - 10.1016/j.aca.2021.339001

M3 - Article

C2 - 34627524

SN - 0003-2670

VL - 1183

JO - Analytica Chimica Acta

JF - Analytica Chimica Acta

M1 - 339001

ER -