Abstract
Introduction: Precision medicine requires the accurate identification of genes and pathways that mechanistically define a disease phenotype. Modern omics may deliver this, but has until now yielded only few translational successes. While gene signatures derived from single omics analysis have proven useful for disease diagnosis and prognosis, they often do not explain the underlying mechanism.
Methods: We here present Grand Forest, an ensemble learning method that extends random forests and integrates experimental data with molecular interaction networks to discover relevant endophenotypes and their defining gene modules. Our method covers two application scenarios: a supervised method for finding modules associated with outcome and an unsupervised method for finding de novo patient subgroups.
Results: We applied the supervised Grand Forest methodology to five disease-related transcriptome data sets and compared the results with four state-of-the-art methods. Grand Forest consistently found gene modules with greater biomedical relevance, reproducibility, and interaction density, but fewer differentially expressed genes. Using the unsupervised method to discover gene modules from unlabeled data, lung cancer patients could be de novo stratified into clinically relevant molecular subgroups. Further analysis revealed that known disease genes were only marginally over-represented among differentially expressed genes, and that our method was driven mainly by network topology.
Conclusion: With Grand Forest, we developed a novel approach to disease module discovery and demonstrated it identifies biologically relevant gene modules and patient subgroups. We conclude that differential expression was not effective for identifying driving genes and that the results were likely confounded by bias in the network data. We caution readers to consider these issues when applying network-based methods to gene expression analysis. Grand Forest is available at https://grandforest.compbio.sdu.dk.
Methods: We here present Grand Forest, an ensemble learning method that extends random forests and integrates experimental data with molecular interaction networks to discover relevant endophenotypes and their defining gene modules. Our method covers two application scenarios: a supervised method for finding modules associated with outcome and an unsupervised method for finding de novo patient subgroups.
Results: We applied the supervised Grand Forest methodology to five disease-related transcriptome data sets and compared the results with four state-of-the-art methods. Grand Forest consistently found gene modules with greater biomedical relevance, reproducibility, and interaction density, but fewer differentially expressed genes. Using the unsupervised method to discover gene modules from unlabeled data, lung cancer patients could be de novo stratified into clinically relevant molecular subgroups. Further analysis revealed that known disease genes were only marginally over-represented among differentially expressed genes, and that our method was driven mainly by network topology.
Conclusion: With Grand Forest, we developed a novel approach to disease module discovery and demonstrated it identifies biologically relevant gene modules and patient subgroups. We conclude that differential expression was not effective for identifying driving genes and that the results were likely confounded by bias in the network data. We caution readers to consider these issues when applying network-based methods to gene expression analysis. Grand Forest is available at https://grandforest.compbio.sdu.dk.
Original language | English |
---|---|
Pages (from-to) | 8-21 |
Journal | Systems medicine (New Rochelle, N.Y.) |
Volume | 3 |
Issue number | 1 |
DOIs | |
Publication status | Published - 31 Jan 2020 |