Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)

Maryam Panahiazar; Michel Dumontier; Olivier Gevaert

doi:10.1016/j.jbi.2017.06.017

Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)

Maryam Panahiazar, Michel Dumontier, Olivier Gevaert^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. The Authors.

Original language	English
Pages (from-to)	132-139
Journal	Journal of Biomedical Informatics
Volume	72
DOIs	https://doi.org/10.1016/j.jbi.2017.06.017
Publication status	Published - Aug 2017

Keywords

Data mining
Prediction
Metadata
GEO
CEDAR

Access to Document

10.1016/j.jbi.2017.06.017Licence: CC BY-NC-ND

Cite this

@article{1e44878ba4234bc6b4a590019a42383c,

title = "Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)",

abstract = "A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. The Authors.",

keywords = "Data mining, Prediction, Metadata, GEO, CEDAR",

author = "Maryam Panahiazar and Michel Dumontier and Olivier Gevaert",

year = "2017",

month = aug,

doi = "10.1016/j.jbi.2017.06.017",

language = "English",

volume = "72",

pages = "132--139",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Elsevier Science",

}

TY - JOUR

T1 - Predicting biomedical metadata in CEDAR: A study of Gene Expression Omnibus (GEO)

AU - Panahiazar, Maryam

AU - Dumontier, Michel

AU - Gevaert, Olivier

PY - 2017/8

Y1 - 2017/8

N2 - A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. The Authors.

AB - A crucial and limiting factor in data reuse is the lack of accurate, structured, and complete descriptions of data, known as metadata. Towards improving the quantity and quality of metadata, we propose a novel metadata prediction framework to learn associations from existing metadata that can be used to predict metadata values. We evaluate our framework in the context of experimental metadata from the Gene Expression Omnibus (GEO). We applied four rule mining algorithms to the most common structured metadata elements (sample type, molecular type, platform, label type and organism) from over 1.3million GEO records. We examined the quality of well supported rules from each algorithm and visualized the dependencies among metadata elements. Finally, we evaluated the performance of the algorithms in terms of accuracy, precision, recall, and F-measure. We found that PART is the best algorithm outperforming Apriori, Predictive Apriori, and Decision Table. All algorithms perform significantly better in predicting class values than the majority vote classifier. We found that the performance of the algorithms is related to the dimensionality of the GEO elements. The average performance of all algorithm increases due of the decreasing of dimensionality of the unique values of these elements (2697 platforms, 537 organisms, 454 labels, 9 molecules, and 5 types). Our work suggests that experimental metadata such as present in GEO can be accurately predicted using rule mining algorithms. Our work has implications for both prospective and retrospective augmentation of metadata quality, which are geared towards making data easier to find and reuse. The Authors.

KW - Data mining

KW - Prediction

KW - Metadata

KW - GEO

KW - CEDAR

U2 - 10.1016/j.jbi.2017.06.017

DO - 10.1016/j.jbi.2017.06.017

M3 - Article

C2 - 28625880

SN - 1532-0464

VL - 72

SP - 132

EP - 139

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

ER -