MASSISTANT: A deep learning model for De Novo molecular structure prediction from EI-MS spectra via SELFIES encoding

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Gas chromatography coupled with electron impact mass spectrometry (GC-EI-MS) is a widely used analytical technique for identifying volatile and semi-volatile compounds in applications ranging from pharmaceutical research to material science. However, since not every molecule is included in EI-MS databases, scientists often have to identify unknown chromatographic peaks solely from their EI-MS spectra. This manual interpretation is time-consuming and depends heavily on expert knowledge, often leading to ambiguous or inconclusive results. In this work, we introduce MASSISTANT, a novel deep learning model that directly predicts de novo molecular structures from low-resolution EI-MS spectra using SELFIES encoding. Trained on compounds with molecular weights below 600 Da, MASSISTANT's performance is sensitive to dataset curation; while training on the full NIST dataset (180k spectra) yields approximately 10 % exact predictions, a more focused, chemically homogeneous subset boosts this rate to as high as 54 % (Tanimoto score = 1). These results highlight the capability of deep neural networks to capture complex fragmentation patterns and generate chemically valid structures, offering mass spectrometry scientists a powerful tool to enhance the interpretation and elucidation of whole molecular structures but also substructures, and functional groups in GC-EI-MS analyses.
Original languageEnglish
Article number466216
JournalJournal of Chromatography A
Volume1759
DOIs
Publication statusPublished - 27 Sept 2025

Keywords

  • Cheminformatics
  • De novo structure prediction
  • Deep learning
  • Electron impact mass spectrometry
  • GC-MS
  • SELFIES

Fingerprint

Dive into the research topics of 'MASSISTANT: A deep learning model for De Novo molecular structure prediction from EI-MS spectra via SELFIES encoding'. Together they form a unique fingerprint.

Cite this