A Parallelization Strategy for the Time Efficient Analysis of Thousands of LC/MS Runs in High-Performance Computing Environment

Patrick van Zalm; Arthur Viodé; Kinga Smolen; Benoit Fatou; Arash Nemati Hayati; Christoph N Schlaffner; Ofer Levy; Judith Steen; Hanno Steen

doi:10.1021/acs.jproteome.2c00278

A Parallelization Strategy for the Time Efficient Analysis of Thousands of LC/MS Runs in High-Performance Computing Environment

Patrick van Zalm, Arthur Viodé, Kinga Smolen, Benoit Fatou, Arash Nemati Hayati, Christoph N Schlaffner, Ofer Levy, Judith Steen, Hanno Steen^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Combining robust proteomics instrumentation with high-throughput enabling liquid chromatography (LC) systems (e.g., timsTOF Pro and the Evosep One system, respectively) enabled mapping the proteomes of 1000s of samples. Fragpipe is one of the few computational protein identification and quantification frameworks that allows for the time-efficient analysis of such large data sets. However, it requires large amounts of computational power and data storage space that leave even state-of-the-art workstations underpowered when it comes to the analysis of proteomics data sets with 1000s of LC mass spectrometry runs. To address this issue, we developed and optimized a Fragpipe-based analysis strategy for a high-performance computing environment and analyzed 3348 plasma samples (6.4 TB) that were longitudinally collected from hospitalized COVID-19 patients under the auspice of the Immunophenotyping Assessment in a COVID-19 Cohort (IMPACC) study. Our parallelization strategy reduced the total runtime by ∼90% from 116 (theoretical) days to just 9 days in the high-performance computing environment. All code is open-source and can be deployed in any Simple Linux Utility for Resource Management (SLURM) high-performance computing environment, enabling the analysis of large-scale high-throughput proteomics studies.

Original language	English
Pages (from-to)	2810-2814
Number of pages	5
Journal	Journal of Proteome Research
Volume	21
Issue number	11
Early online date	6 Oct 2022
DOIs	https://doi.org/10.1021/acs.jproteome.2c00278
Publication status	Published - 4 Nov 2022

Keywords

Fragpipe
HPC
SLURM
parallelization
proteomics
timsTOF

Access to Document

10.1021/acs.jproteome.2c00278

Cite this

@article{7cc36787cc1f447f87d7b2204bd20c96,

title = "A Parallelization Strategy for the Time Efficient Analysis of Thousands of LC/MS Runs in High-Performance Computing Environment",

abstract = "Combining robust proteomics instrumentation with high-throughput enabling liquid chromatography (LC) systems (e.g., timsTOF Pro and the Evosep One system, respectively) enabled mapping the proteomes of 1000s of samples. Fragpipe is one of the few computational protein identification and quantification frameworks that allows for the time-efficient analysis of such large data sets. However, it requires large amounts of computational power and data storage space that leave even state-of-the-art workstations underpowered when it comes to the analysis of proteomics data sets with 1000s of LC mass spectrometry runs. To address this issue, we developed and optimized a Fragpipe-based analysis strategy for a high-performance computing environment and analyzed 3348 plasma samples (6.4 TB) that were longitudinally collected from hospitalized COVID-19 patients under the auspice of the Immunophenotyping Assessment in a COVID-19 Cohort (IMPACC) study. Our parallelization strategy reduced the total runtime by ∼90% from 116 (theoretical) days to just 9 days in the high-performance computing environment. All code is open-source and can be deployed in any Simple Linux Utility for Resource Management (SLURM) high-performance computing environment, enabling the analysis of large-scale high-throughput proteomics studies.",

keywords = "Fragpipe, HPC, SLURM, parallelization, proteomics, timsTOF",

author = "{van Zalm}, Patrick and Arthur Viod{\'e} and Kinga Smolen and Benoit Fatou and Hayati, {Arash Nemati} and Schlaffner, {Christoph N} and Ofer Levy and Judith Steen and Hanno Steen",

note = "Funding Information: We acknowledge (i) Arthur D. Stuivenberg for his help on the principles of software containerization, (ii) all IMPACC study participants for enrolling in this study and providing valuable samples, (iii) all clinical teams for collecting and processing all the samples and caring for the study participant, and (iv) the MPACC leadership for supervision, guidance, and leadership. This work was supported by the grant U19 AI118608-01A1 to O.L. as part of the IMPACC study funded by the United States National Institutes of Health through the following grants: 5R01AI135803-03, 5U19 AI128910-04, 4U19AI090023-11, 4U19AI118610-06, R01AI145835-01A1S1, 5U19AI062629-17, 5U19AI057229-17, 5U19AI125357-05, 5U19AI128913-03, 3U19AI077439-13, 5U54AI142766-03, 5R01AI104870-07, 3U19AI089992-09. The Precision Vaccines Program is supported in part by the Boston Children{\textquoteright}s Hospital Department of Pediatrics and philanthropy from the Boston Investment Council and the Barry family. Publisher Copyright: {\textcopyright} 2022 American Chemical Society. All rights reserved.",

year = "2022",

month = nov,

day = "4",

doi = "10.1021/acs.jproteome.2c00278",

language = "English",

volume = "21",

pages = "2810--2814",

journal = "Journal of Proteome Research",

issn = "1535-3893",

publisher = "American Chemical Society",

number = "11",

}

TY - JOUR

T1 - A Parallelization Strategy for the Time Efficient Analysis of Thousands of LC/MS Runs in High-Performance Computing Environment

AU - van Zalm, Patrick

AU - Viodé, Arthur

AU - Smolen, Kinga

AU - Fatou, Benoit

AU - Hayati, Arash Nemati

AU - Schlaffner, Christoph N

AU - Levy, Ofer

AU - Steen, Judith

AU - Steen, Hanno

N1 - Funding Information: We acknowledge (i) Arthur D. Stuivenberg for his help on the principles of software containerization, (ii) all IMPACC study participants for enrolling in this study and providing valuable samples, (iii) all clinical teams for collecting and processing all the samples and caring for the study participant, and (iv) the MPACC leadership for supervision, guidance, and leadership. This work was supported by the grant U19 AI118608-01A1 to O.L. as part of the IMPACC study funded by the United States National Institutes of Health through the following grants: 5R01AI135803-03, 5U19 AI128910-04, 4U19AI090023-11, 4U19AI118610-06, R01AI145835-01A1S1, 5U19AI062629-17, 5U19AI057229-17, 5U19AI125357-05, 5U19AI128913-03, 3U19AI077439-13, 5U54AI142766-03, 5R01AI104870-07, 3U19AI089992-09. The Precision Vaccines Program is supported in part by the Boston Children’s Hospital Department of Pediatrics and philanthropy from the Boston Investment Council and the Barry family. Publisher Copyright: © 2022 American Chemical Society. All rights reserved.

PY - 2022/11/4

Y1 - 2022/11/4

N2 - Combining robust proteomics instrumentation with high-throughput enabling liquid chromatography (LC) systems (e.g., timsTOF Pro and the Evosep One system, respectively) enabled mapping the proteomes of 1000s of samples. Fragpipe is one of the few computational protein identification and quantification frameworks that allows for the time-efficient analysis of such large data sets. However, it requires large amounts of computational power and data storage space that leave even state-of-the-art workstations underpowered when it comes to the analysis of proteomics data sets with 1000s of LC mass spectrometry runs. To address this issue, we developed and optimized a Fragpipe-based analysis strategy for a high-performance computing environment and analyzed 3348 plasma samples (6.4 TB) that were longitudinally collected from hospitalized COVID-19 patients under the auspice of the Immunophenotyping Assessment in a COVID-19 Cohort (IMPACC) study. Our parallelization strategy reduced the total runtime by ∼90% from 116 (theoretical) days to just 9 days in the high-performance computing environment. All code is open-source and can be deployed in any Simple Linux Utility for Resource Management (SLURM) high-performance computing environment, enabling the analysis of large-scale high-throughput proteomics studies.

AB - Combining robust proteomics instrumentation with high-throughput enabling liquid chromatography (LC) systems (e.g., timsTOF Pro and the Evosep One system, respectively) enabled mapping the proteomes of 1000s of samples. Fragpipe is one of the few computational protein identification and quantification frameworks that allows for the time-efficient analysis of such large data sets. However, it requires large amounts of computational power and data storage space that leave even state-of-the-art workstations underpowered when it comes to the analysis of proteomics data sets with 1000s of LC mass spectrometry runs. To address this issue, we developed and optimized a Fragpipe-based analysis strategy for a high-performance computing environment and analyzed 3348 plasma samples (6.4 TB) that were longitudinally collected from hospitalized COVID-19 patients under the auspice of the Immunophenotyping Assessment in a COVID-19 Cohort (IMPACC) study. Our parallelization strategy reduced the total runtime by ∼90% from 116 (theoretical) days to just 9 days in the high-performance computing environment. All code is open-source and can be deployed in any Simple Linux Utility for Resource Management (SLURM) high-performance computing environment, enabling the analysis of large-scale high-throughput proteomics studies.

KW - Fragpipe

KW - HPC

KW - SLURM

KW - parallelization

KW - proteomics

KW - timsTOF

U2 - 10.1021/acs.jproteome.2c00278

DO - 10.1021/acs.jproteome.2c00278

M3 - Article

C2 - 36201825

SN - 1535-3893

VL - 21

SP - 2810

EP - 2814

JO - Journal of Proteome Research

JF - Journal of Proteome Research

IS - 11

ER -