Predicting missing proteomics values using machine learning: Filling the gap using transcriptomics and other biological features

Juan Ochoteco Asensio; Marcha Verheijen; Florian Caiment

doi:10.1016/j.csbj.2022.04.017

Predicting missing proteomics values using machine learning: Filling the gap using transcriptomics and other biological features

Juan Ochoteco Asensio, Marcha Verheijen, Florian Caiment^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Proteins are often considered the main biological element in charge of the different functions and structures of a cell. However, proteomics, the global study of all expressed proteins, often performed by mass spectrometry, is limited by its stochastic sampling and can only quantify a limited amount of protein per sample. Transcriptomics, which allows an exhaustive analysis of all expressed transcripts, is often used as a surrogate. However, the transcript level does not present a high level of correlation with the corresponding protein level, notably due to the existence of several post-transcriptional regulatory mechanisms. In this publication, we hypothesize that the missing protein values in proteomics could be predicted using machine learning regression methods, trained with many features extracted from transcriptomics, including known translational regulatory elements such as microRNAs and circular RNAs. After considering different machine learning algorithms applied on two different splitting strategies, we report that random forest can predict proteins in new samples out of transcriptomics data with good accuracy. The proposed pre-processing and model building scripts can be accessed on GitHub: https://github.com/jochotecoa/ml_proteomics.

Original language	English
Pages (from-to)	2057-2069
Number of pages	13
Journal	Computational and Structural Biotechnology Journal
Volume	20
DOIs	https://doi.org/10.1016/j.csbj.2022.04.017
Publication status	Published - 2022

Keywords

CIRCULAR RNAS
EXPRESSION
MICRORNAS
Machine Learning
PROTEIN
Proteomics
QUANTIFICATION
Rna-sequencing
Transcriptomics

Access to Document

10.1016/j.csbj.2022.04.017Licence: CC BY

Cite this

@article{2ed93da567504f0fb4cff3f7633635f0,

title = "Predicting missing proteomics values using machine learning: Filling the gap using transcriptomics and other biological features",

abstract = "Proteins are often considered the main biological element in charge of the different functions and structures of a cell. However, proteomics, the global study of all expressed proteins, often performed by mass spectrometry, is limited by its stochastic sampling and can only quantify a limited amount of protein per sample. Transcriptomics, which allows an exhaustive analysis of all expressed transcripts, is often used as a surrogate. However, the transcript level does not present a high level of correlation with the corresponding protein level, notably due to the existence of several post-transcriptional regulatory mechanisms. In this publication, we hypothesize that the missing protein values in proteomics could be predicted using machine learning regression methods, trained with many features extracted from transcriptomics, including known translational regulatory elements such as microRNAs and circular RNAs. After considering different machine learning algorithms applied on two different splitting strategies, we report that random forest can predict proteins in new samples out of transcriptomics data with good accuracy. The proposed pre-processing and model building scripts can be accessed on GitHub: https://github.com/jochotecoa/ml_proteomics.",

keywords = "CIRCULAR RNAS, EXPRESSION, MICRORNAS, Machine Learning, PROTEIN, Proteomics, QUANTIFICATION, Rna-sequencing, Transcriptomics",

author = "{Ochoteco Asensio}, Juan and Marcha Verheijen and Florian Caiment",

note = "{\textcopyright} 2022 The Authors.",

year = "2022",

doi = "10.1016/j.csbj.2022.04.017",

language = "English",

volume = "20",

pages = "2057--2069",

journal = "Computational and Structural Biotechnology Journal",

issn = "2001-0370",

publisher = "Research Network of Computational and Structural Biotechnology",

}

TY - JOUR

T1 - Predicting missing proteomics values using machine learning

T2 - Filling the gap using transcriptomics and other biological features

AU - Ochoteco Asensio, Juan

AU - Verheijen, Marcha

AU - Caiment, Florian

PY - 2022

Y1 - 2022

N2 - Proteins are often considered the main biological element in charge of the different functions and structures of a cell. However, proteomics, the global study of all expressed proteins, often performed by mass spectrometry, is limited by its stochastic sampling and can only quantify a limited amount of protein per sample. Transcriptomics, which allows an exhaustive analysis of all expressed transcripts, is often used as a surrogate. However, the transcript level does not present a high level of correlation with the corresponding protein level, notably due to the existence of several post-transcriptional regulatory mechanisms. In this publication, we hypothesize that the missing protein values in proteomics could be predicted using machine learning regression methods, trained with many features extracted from transcriptomics, including known translational regulatory elements such as microRNAs and circular RNAs. After considering different machine learning algorithms applied on two different splitting strategies, we report that random forest can predict proteins in new samples out of transcriptomics data with good accuracy. The proposed pre-processing and model building scripts can be accessed on GitHub: https://github.com/jochotecoa/ml_proteomics.

AB - Proteins are often considered the main biological element in charge of the different functions and structures of a cell. However, proteomics, the global study of all expressed proteins, often performed by mass spectrometry, is limited by its stochastic sampling and can only quantify a limited amount of protein per sample. Transcriptomics, which allows an exhaustive analysis of all expressed transcripts, is often used as a surrogate. However, the transcript level does not present a high level of correlation with the corresponding protein level, notably due to the existence of several post-transcriptional regulatory mechanisms. In this publication, we hypothesize that the missing protein values in proteomics could be predicted using machine learning regression methods, trained with many features extracted from transcriptomics, including known translational regulatory elements such as microRNAs and circular RNAs. After considering different machine learning algorithms applied on two different splitting strategies, we report that random forest can predict proteins in new samples out of transcriptomics data with good accuracy. The proposed pre-processing and model building scripts can be accessed on GitHub: https://github.com/jochotecoa/ml_proteomics.

KW - CIRCULAR RNAS

KW - EXPRESSION

KW - MICRORNAS

KW - Machine Learning

KW - PROTEIN

KW - Proteomics

KW - QUANTIFICATION

KW - Rna-sequencing

KW - Transcriptomics

U2 - 10.1016/j.csbj.2022.04.017

DO - 10.1016/j.csbj.2022.04.017

M3 - Article

C2 - 35601960

SN - 2001-0370

VL - 20

SP - 2057

EP - 2069

JO - Computational and Structural Biotechnology Journal

JF - Computational and Structural Biotechnology Journal

ER -