Experience: Automated Prediction of Experimental Metadata from Scientific Publications

S. Nayak; A. Zaveri; P.H. Serrano; M. Dumontier

doi:10.1145/3451219

Experience: Automated Prediction of Experimental Metadata from Scientific Publications

S. Nayak^*, A. Zaveri, P.H. Serrano, M. Dumontier

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

19 Downloads (Pure)

Abstract

While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In particular, metadata are useful to understand the nature and provenance of the data. A common approach to improving the quality of metadata relies on expensive human curation, which itself is time-consuming and also prone to error. Towards improving the quality of metadata, we use scientific publications to automatically predict metadata key:value pairs. For prediction, we use a Convolutional Neural Network (CNN) and a Bidirectional Long-short term memory network (BiLSTM). We focus our attention on the NCBI Disease Corpus, which is used for training the CNN and BiLSTM. We perform two different kinds of experiments with these two architectures: (1) we predict the disease names by using their unique ID in the MeSH ontology and (2) we use the tree structures of MeSH ontology to move up in the hierarchy of these disease terms, which reduces the number of labels. We also perform various multi-label classification techniques for the above-mentioned experiments. We find that in both cases CNN achieves the best results in predicting the superclasses for disease with an accuracy of 83%.

Original language	English
Article number	21
Number of pages	11
Journal	Journal of Data and Information Quality
Volume	13
Issue number	4
DOIs	https://doi.org/10.1145/3451219
Publication status	Published - 1 Dec 2021

Keywords

Datasets
neural networks
metadata
quality
natural language processing

Access to Document

10.1145/3451219

Full TextFinal published version, 602 KBLicence: Taverne

Cite this

@article{f3bf7f53879f4dad967c87b040d4dd85,

title = "Experience: Automated Prediction of Experimental Metadata from Scientific Publications",

abstract = "While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In particular, metadata are useful to understand the nature and provenance of the data. A common approach to improving the quality of metadata relies on expensive human curation, which itself is time-consuming and also prone to error. Towards improving the quality of metadata, we use scientific publications to automatically predict metadata key:value pairs. For prediction, we use a Convolutional Neural Network (CNN) and a Bidirectional Long-short term memory network (BiLSTM). We focus our attention on the NCBI Disease Corpus, which is used for training the CNN and BiLSTM. We perform two different kinds of experiments with these two architectures: (1) we predict the disease names by using their unique ID in the MeSH ontology and (2) we use the tree structures of MeSH ontology to move up in the hierarchy of these disease terms, which reduces the number of labels. We also perform various multi-label classification techniques for the above-mentioned experiments. We find that in both cases CNN achieves the best results in predicting the superclasses for disease with an accuracy of 83%.",

keywords = "Datasets, neural networks, metadata, quality, natural language processing",

author = "S. Nayak and A. Zaveri and P.H. Serrano and M. Dumontier",

note = "Publisher Copyright: {\textcopyright} 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.",

year = "2021",

month = dec,

day = "1",

doi = "10.1145/3451219",

language = "English",

volume = "13",

journal = "Journal of Data and Information Quality",

issn = "1936-1955",

publisher = "Association for Computing Machinery (ACM)",

number = "4",

}

TY - JOUR

T1 - Experience: Automated Prediction of Experimental Metadata from Scientific Publications

AU - Nayak, S.

AU - Zaveri, A.

AU - Serrano, P.H.

AU - Dumontier, M.

PY - 2021/12/1

Y1 - 2021/12/1

N2 - While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In particular, metadata are useful to understand the nature and provenance of the data. A common approach to improving the quality of metadata relies on expensive human curation, which itself is time-consuming and also prone to error. Towards improving the quality of metadata, we use scientific publications to automatically predict metadata key:value pairs. For prediction, we use a Convolutional Neural Network (CNN) and a Bidirectional Long-short term memory network (BiLSTM). We focus our attention on the NCBI Disease Corpus, which is used for training the CNN and BiLSTM. We perform two different kinds of experiments with these two architectures: (1) we predict the disease names by using their unique ID in the MeSH ontology and (2) we use the tree structures of MeSH ontology to move up in the hierarchy of these disease terms, which reduces the number of labels. We also perform various multi-label classification techniques for the above-mentioned experiments. We find that in both cases CNN achieves the best results in predicting the superclasses for disease with an accuracy of 83%.

AB - While there exists an abundance of open biomedical data, the lack of high-quality metadata makes it challenging for others to find relevant datasets and to reuse them for another purpose. In particular, metadata are useful to understand the nature and provenance of the data. A common approach to improving the quality of metadata relies on expensive human curation, which itself is time-consuming and also prone to error. Towards improving the quality of metadata, we use scientific publications to automatically predict metadata key:value pairs. For prediction, we use a Convolutional Neural Network (CNN) and a Bidirectional Long-short term memory network (BiLSTM). We focus our attention on the NCBI Disease Corpus, which is used for training the CNN and BiLSTM. We perform two different kinds of experiments with these two architectures: (1) we predict the disease names by using their unique ID in the MeSH ontology and (2) we use the tree structures of MeSH ontology to move up in the hierarchy of these disease terms, which reduces the number of labels. We also perform various multi-label classification techniques for the above-mentioned experiments. We find that in both cases CNN achieves the best results in predicting the superclasses for disease with an accuracy of 83%.

KW - Datasets

KW - neural networks

KW - metadata

KW - quality

KW - natural language processing

U2 - 10.1145/3451219

DO - 10.1145/3451219

M3 - Article

SN - 1936-1955

VL - 13

JO - Journal of Data and Information Quality

JF - Journal of Data and Information Quality

IS - 4

M1 - 21

ER -