Relation extraction from DailyMed structured product labels by optimally combining crowd, experts and machines

K. Shingjergji; R. Celebi; J. Scholtes; M. Dumontier

doi:10.1016/j.jbi.2021.103902

Relation extraction from DailyMed structured product labels by optimally combining crowd, experts and machines

K. Shingjergji, R. Celebi^*, J. Scholtes, M. Dumontier

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

ABSTR A C T The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treat-ment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.

Original language	English
Article number	103902
Number of pages	11
Journal	Journal of Biomedical Informatics
Volume	122
DOIs	https://doi.org/10.1016/j.jbi.2021.103902
Publication status	Published - 1 Oct 2021

Keywords

Drug-disease relation classification
Drug indications
Drug data quality
Drug repositioning
Weak supervision
Programmatic labeling
Crowdsourcing
Human-in-the-loop
Machine learning

Access to Document

10.1016/j.jbi.2021.103902Licence: CC BY

Cite this

@article{77de229802234576ac99a541cd217a37,

title = "Relation extraction from DailyMed structured product labels by optimally combining crowd, experts and machines",

abstract = "ABSTR A C T The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treat-ment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.",

keywords = "Drug-disease relation classification, Drug indications, Drug data quality, Drug repositioning, Weak supervision, Programmatic labeling, Crowdsourcing, Human-in-the-loop, Machine learning",

author = "K. Shingjergji and R. Celebi and J. Scholtes and M. Dumontier",

note = "Funding Information: We thank Dr. Ozgun Erten, Dr. Marwa Abdelhakim, Dr. Sherwin Kuo and Dr. Tiffany I. Leung, for providing us with the expert annotation of the data set. Publisher Copyright: {\textcopyright} 2021 The Author(s)",

year = "2021",

month = oct,

day = "1",

doi = "10.1016/j.jbi.2021.103902",

language = "English",

volume = "122",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Elsevier Science",

}

TY - JOUR

T1 - Relation extraction from DailyMed structured product labels by optimally combining crowd, experts and machines

AU - Shingjergji, K.

AU - Celebi, R.

AU - Scholtes, J.

AU - Dumontier, M.

N1 - Funding Information: We thank Dr. Ozgun Erten, Dr. Marwa Abdelhakim, Dr. Sherwin Kuo and Dr. Tiffany I. Leung, for providing us with the expert annotation of the data set. Publisher Copyright: © 2021 The Author(s)

PY - 2021/10/1

Y1 - 2021/10/1

N2 - ABSTR A C T The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treat-ment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.

AB - ABSTR A C T The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treat-ment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.

KW - Drug-disease relation classification

KW - Drug indications

KW - Drug data quality

KW - Drug repositioning

KW - Weak supervision

KW - Programmatic labeling

KW - Crowdsourcing

KW - Human-in-the-loop

KW - Machine learning

U2 - 10.1016/j.jbi.2021.103902

DO - 10.1016/j.jbi.2021.103902

M3 - Article

C2 - 34481057

SN - 1532-0464

VL - 122

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

M1 - 103902

ER -