De Novo and Supervised Endophenotyping Using Network-Guided Ensemble Learning

Simon J. Larsen*, Harald H.H.W. Schmidt, Jan Baumbach

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

35 Downloads (Pure)


Introduction: Precision medicine requires the accurate identification of genes and pathways that mechanistically define a disease phenotype. Modern omics may deliver this, but has until now yielded only few translational successes. While gene signatures derived from single omics analysis have proven useful for disease diagnosis and prognosis, they often do not explain the underlying mechanism.
Methods: We here present Grand Forest, an ensemble learning method that extends random forests and integrates experimental data with molecular interaction networks to discover relevant endophenotypes and their defining gene modules. Our method covers two application scenarios: a supervised method for finding modules associated with outcome and an unsupervised method for finding de novo patient subgroups.
Results: We applied the supervised Grand Forest methodology to five disease-related transcriptome data sets and compared the results with four state-of-the-art methods. Grand Forest consistently found gene modules with greater biomedical relevance, reproducibility, and interaction density, but fewer differentially expressed genes. Using the unsupervised method to discover gene modules from unlabeled data, lung cancer patients could be de novo stratified into clinically relevant molecular subgroups. Further analysis revealed that known disease genes were only marginally over-represented among differentially expressed genes, and that our method was driven mainly by network topology.
Conclusion: With Grand Forest, we developed a novel approach to disease module discovery and demonstrated it identifies biologically relevant gene modules and patient subgroups. We conclude that differential expression was not effective for identifying driving genes and that the results were likely confounded by bias in the network data. We caution readers to consider these issues when applying network-based methods to gene expression analysis. Grand Forest is available at
Original languageEnglish
Pages (from-to)8-21
JournalSystems medicine (New Rochelle, N.Y.)
Issue number1
Publication statusPublished - 31 Jan 2020

Cite this