Interpretation and visualization of non-linear data fusion in kernel space: Study on metabolomic characterization of progression of multiple sclerosis

A.M. Smolinska; L. Blanchet; L. Coulier; K.A. Ampt; T. Luider; R.Q. Hintzen; S.S. Wijmenga; L.M. Buydens

doi:10.1371/journal.pone.0038163

Interpretation and visualization of non-linear data fusion in kernel space: Study on metabolomic characterization of progression of multiple sclerosis

A.M. Smolinska^*, L. Blanchet, L. Coulier, K.A. Ampt, T. Luider, R.Q. Hintzen, S.S. Wijmenga, L.M. Buydens

^*Corresponding author for this work

Farmacologie en Toxicologie

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

BACKGROUND: In the last decade data fusion has become widespread in the field of metabolomics. Linear data fusion is performed most commonly. However, many data display non-linear parameter dependences. The linear methods are bound to fail in such situations. We used proton Nuclear Magnetic Resonance and Gas Chromatography-Mass Spectrometry, two well established techniques, to generate metabolic profiles of Cerebrospinal fluid of Multiple Sclerosis (MScl) individuals. These datasets represent non-linearly separable groups. Thus, to extract relevant information and to combine them a special framework for data fusion is required. METHODOLOGY: The main aim is to demonstrate a novel approach for data fusion for classification; the approach is applied to metabolomics datasets coming from patients suffering from MScl at a different stage of the disease. The approach involves data fusion in kernel space and consists of four main steps. The first one is to extract the significant information per data source using Support Vector Machine Recursive Feature Elimination. This method allows one to select a set of relevant variables. In the next step the optimized kernel matrices are merged by linear combination. In step 3 the merged datasets are analyzed with a classification technique, namely Kernel Partial Least Square Discriminant Analysis. In the final step, the variables in kernel space are visualized and their significance established. CONCLUSIONS: We find that fusion in kernel space allows for efficient and reliable discrimination of classes (MScl and early stage). This data fusion approach achieves better class prediction accuracy than analysis of individual datasets and the commonly used mid-level fusion. The prediction accuracy on an independent test set (8 samples) reaches 100%. Additionally, the classification model obtained on fused kernels is simpler in terms of complexity, i.e. just one latent variable was sufficient. Finally, visualization of variables importance in kernel space was achieved.

Original language	English
Article number	38163
Number of pages	12
Journal	PLOS ONE
Volume	7
Issue number	6
DOIs	https://doi.org/10.1371/journal.pone.0038163
Publication status	Published - 8 Jun 2012

Keywords

SUPPORT VECTOR MACHINES
CEREBROSPINAL-FLUID
PATTERN-RECOGNITION
VALIDATION
SELECTION
MODELS
DISCRIMINATION
CLASSIFICATION
SPECTROMETRY
PROTEOMICS

Access to Document

10.1371/journal.pone.0038163Licence: CC BY

Cite this

@article{41a88950eb7d4d9c87428fa83dc699ba,

title = "Interpretation and visualization of non-linear data fusion in kernel space: Study on metabolomic characterization of progression of multiple sclerosis",

abstract = "BACKGROUND: In the last decade data fusion has become widespread in the field of metabolomics. Linear data fusion is performed most commonly. However, many data display non-linear parameter dependences. The linear methods are bound to fail in such situations. We used proton Nuclear Magnetic Resonance and Gas Chromatography-Mass Spectrometry, two well established techniques, to generate metabolic profiles of Cerebrospinal fluid of Multiple Sclerosis (MScl) individuals. These datasets represent non-linearly separable groups. Thus, to extract relevant information and to combine them a special framework for data fusion is required. METHODOLOGY: The main aim is to demonstrate a novel approach for data fusion for classification; the approach is applied to metabolomics datasets coming from patients suffering from MScl at a different stage of the disease. The approach involves data fusion in kernel space and consists of four main steps. The first one is to extract the significant information per data source using Support Vector Machine Recursive Feature Elimination. This method allows one to select a set of relevant variables. In the next step the optimized kernel matrices are merged by linear combination. In step 3 the merged datasets are analyzed with a classification technique, namely Kernel Partial Least Square Discriminant Analysis. In the final step, the variables in kernel space are visualized and their significance established. CONCLUSIONS: We find that fusion in kernel space allows for efficient and reliable discrimination of classes (MScl and early stage). This data fusion approach achieves better class prediction accuracy than analysis of individual datasets and the commonly used mid-level fusion. The prediction accuracy on an independent test set (8 samples) reaches 100%. Additionally, the classification model obtained on fused kernels is simpler in terms of complexity, i.e. just one latent variable was sufficient. Finally, visualization of variables importance in kernel space was achieved.",

keywords = "SUPPORT VECTOR MACHINES, CEREBROSPINAL-FLUID, PATTERN-RECOGNITION, VALIDATION, SELECTION, MODELS, DISCRIMINATION, CLASSIFICATION, SPECTROMETRY, PROTEOMICS",

author = "A.M. Smolinska and L. Blanchet and L. Coulier and K.A. Ampt and T. Luider and R.Q. Hintzen and S.S. Wijmenga and L.M. Buydens",

year = "2012",

month = jun,

day = "8",

doi = "10.1371/journal.pone.0038163",

language = "English",

volume = "7",

journal = "PLOS ONE",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "6",

}

TY - JOUR

T1 - Interpretation and visualization of non-linear data fusion in kernel space: Study on metabolomic characterization of progression of multiple sclerosis

AU - Smolinska, A.M.

AU - Blanchet, L.

AU - Coulier, L.

AU - Ampt, K.A.

AU - Luider, T.

AU - Hintzen, R.Q.

AU - Wijmenga, S.S.

AU - Buydens, L.M.

PY - 2012/6/8

Y1 - 2012/6/8

N2 - BACKGROUND: In the last decade data fusion has become widespread in the field of metabolomics. Linear data fusion is performed most commonly. However, many data display non-linear parameter dependences. The linear methods are bound to fail in such situations. We used proton Nuclear Magnetic Resonance and Gas Chromatography-Mass Spectrometry, two well established techniques, to generate metabolic profiles of Cerebrospinal fluid of Multiple Sclerosis (MScl) individuals. These datasets represent non-linearly separable groups. Thus, to extract relevant information and to combine them a special framework for data fusion is required. METHODOLOGY: The main aim is to demonstrate a novel approach for data fusion for classification; the approach is applied to metabolomics datasets coming from patients suffering from MScl at a different stage of the disease. The approach involves data fusion in kernel space and consists of four main steps. The first one is to extract the significant information per data source using Support Vector Machine Recursive Feature Elimination. This method allows one to select a set of relevant variables. In the next step the optimized kernel matrices are merged by linear combination. In step 3 the merged datasets are analyzed with a classification technique, namely Kernel Partial Least Square Discriminant Analysis. In the final step, the variables in kernel space are visualized and their significance established. CONCLUSIONS: We find that fusion in kernel space allows for efficient and reliable discrimination of classes (MScl and early stage). This data fusion approach achieves better class prediction accuracy than analysis of individual datasets and the commonly used mid-level fusion. The prediction accuracy on an independent test set (8 samples) reaches 100%. Additionally, the classification model obtained on fused kernels is simpler in terms of complexity, i.e. just one latent variable was sufficient. Finally, visualization of variables importance in kernel space was achieved.

AB - BACKGROUND: In the last decade data fusion has become widespread in the field of metabolomics. Linear data fusion is performed most commonly. However, many data display non-linear parameter dependences. The linear methods are bound to fail in such situations. We used proton Nuclear Magnetic Resonance and Gas Chromatography-Mass Spectrometry, two well established techniques, to generate metabolic profiles of Cerebrospinal fluid of Multiple Sclerosis (MScl) individuals. These datasets represent non-linearly separable groups. Thus, to extract relevant information and to combine them a special framework for data fusion is required. METHODOLOGY: The main aim is to demonstrate a novel approach for data fusion for classification; the approach is applied to metabolomics datasets coming from patients suffering from MScl at a different stage of the disease. The approach involves data fusion in kernel space and consists of four main steps. The first one is to extract the significant information per data source using Support Vector Machine Recursive Feature Elimination. This method allows one to select a set of relevant variables. In the next step the optimized kernel matrices are merged by linear combination. In step 3 the merged datasets are analyzed with a classification technique, namely Kernel Partial Least Square Discriminant Analysis. In the final step, the variables in kernel space are visualized and their significance established. CONCLUSIONS: We find that fusion in kernel space allows for efficient and reliable discrimination of classes (MScl and early stage). This data fusion approach achieves better class prediction accuracy than analysis of individual datasets and the commonly used mid-level fusion. The prediction accuracy on an independent test set (8 samples) reaches 100%. Additionally, the classification model obtained on fused kernels is simpler in terms of complexity, i.e. just one latent variable was sufficient. Finally, visualization of variables importance in kernel space was achieved.

KW - SUPPORT VECTOR MACHINES

KW - CEREBROSPINAL-FLUID

KW - PATTERN-RECOGNITION

KW - VALIDATION

KW - SELECTION

KW - MODELS

KW - DISCRIMINATION

KW - CLASSIFICATION

KW - SPECTROMETRY

KW - PROTEOMICS

U2 - 10.1371/journal.pone.0038163

DO - 10.1371/journal.pone.0038163

M3 - Article

C2 - 22715376

SN - 1932-6203

VL - 7

JO - PLOS ONE

JF - PLOS ONE

IS - 6

M1 - 38163

ER -