A Simple but Highly Effective Approach to Evaluate the Prognostic Performance of Gene Expression Signatures

Maud H. W. Starmans; Glenn Fung; Harald Steck; Bradly G. Wouters; Philippe Lambin

doi:10.1371/journal.pone.0028320

A Simple but Highly Effective Approach to Evaluate the Prognostic Performance of Gene Expression Signatures

Maud H. W. Starmans^*, Glenn Fung, Harald Steck, Bradly G. Wouters, Philippe Lambin

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Background: Highly parallel analysis of gene expression has recently been used to identify gene sets or 'signatures' to improve patient diagnosis and risk stratification. Once a signature is generated, traditional statistical testing is used to evaluate its prognostic performance. However, due to the dimensionality of microarrays, this can lead to false interpretation of these signatures. Principal Findings: A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six public available cancer patient microarray datasets. We found that a signature consisting of randomly selected genes has an average 10% chance of reaching significance when assessed in a single dataset, but can range from 1% to similar to 40% depending on the dataset in question. Increasing the number of validation datasets markedly reduces this number. Conclusions: We have shown that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived gene signature in a dataset by comparing its performance to thousands of randomly generated signatures. It will be of most interest for cases where few data are available and testing in multiple datasets is limited.

Original language	English
Article number	e28320
Journal	PLOS ONE
Volume	6
Issue number	12
DOIs	https://doi.org/10.1371/journal.pone.0028320
Publication status	Published - 7 Dec 2011

Access to Document

10.1371/journal.pone.0028320Licence: CC BY

Cite this

@article{057cde8b21cb46eda15ba7ef547515bb,

title = "A Simple but Highly Effective Approach to Evaluate the Prognostic Performance of Gene Expression Signatures",

abstract = "Background: Highly parallel analysis of gene expression has recently been used to identify gene sets or 'signatures' to improve patient diagnosis and risk stratification. Once a signature is generated, traditional statistical testing is used to evaluate its prognostic performance. However, due to the dimensionality of microarrays, this can lead to false interpretation of these signatures. Principal Findings: A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six public available cancer patient microarray datasets. We found that a signature consisting of randomly selected genes has an average 10% chance of reaching significance when assessed in a single dataset, but can range from 1% to similar to 40% depending on the dataset in question. Increasing the number of validation datasets markedly reduces this number. Conclusions: We have shown that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived gene signature in a dataset by comparing its performance to thousands of randomly generated signatures. It will be of most interest for cases where few data are available and testing in multiple datasets is limited.",

author = "Starmans, {Maud H. W.} and Glenn Fung and Harald Steck and Wouters, {Bradly G.} and Philippe Lambin",

year = "2011",

month = dec,

day = "7",

doi = "10.1371/journal.pone.0028320",

language = "English",

volume = "6",

journal = "PLOS ONE",

issn = "1932-6203",

publisher = "Public Library of Science",

number = "12",

}

TY - JOUR

T1 - A Simple but Highly Effective Approach to Evaluate the Prognostic Performance of Gene Expression Signatures

AU - Starmans, Maud H. W.

AU - Fung, Glenn

AU - Steck, Harald

AU - Wouters, Bradly G.

AU - Lambin, Philippe

PY - 2011/12/7

Y1 - 2011/12/7

N2 - Background: Highly parallel analysis of gene expression has recently been used to identify gene sets or 'signatures' to improve patient diagnosis and risk stratification. Once a signature is generated, traditional statistical testing is used to evaluate its prognostic performance. However, due to the dimensionality of microarrays, this can lead to false interpretation of these signatures. Principal Findings: A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six public available cancer patient microarray datasets. We found that a signature consisting of randomly selected genes has an average 10% chance of reaching significance when assessed in a single dataset, but can range from 1% to similar to 40% depending on the dataset in question. Increasing the number of validation datasets markedly reduces this number. Conclusions: We have shown that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived gene signature in a dataset by comparing its performance to thousands of randomly generated signatures. It will be of most interest for cases where few data are available and testing in multiple datasets is limited.

AB - Background: Highly parallel analysis of gene expression has recently been used to identify gene sets or 'signatures' to improve patient diagnosis and risk stratification. Once a signature is generated, traditional statistical testing is used to evaluate its prognostic performance. However, due to the dimensionality of microarrays, this can lead to false interpretation of these signatures. Principal Findings: A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six public available cancer patient microarray datasets. We found that a signature consisting of randomly selected genes has an average 10% chance of reaching significance when assessed in a single dataset, but can range from 1% to similar to 40% depending on the dataset in question. Increasing the number of validation datasets markedly reduces this number. Conclusions: We have shown that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived gene signature in a dataset by comparing its performance to thousands of randomly generated signatures. It will be of most interest for cases where few data are available and testing in multiple datasets is limited.

U2 - 10.1371/journal.pone.0028320

DO - 10.1371/journal.pone.0028320

M3 - Article

C2 - 22163293

SN - 1932-6203

VL - 6

JO - PLOS ONE

JF - PLOS ONE

IS - 12

M1 - e28320

ER -