TY - JOUR
T1 - Relation extraction from DailyMed structured product labels by optimally combining crowd, experts and machines
AU - Shingjergji, K.
AU - Celebi, R.
AU - Scholtes, J.
AU - Dumontier, M.
N1 - Funding Information:
We thank Dr. Ozgun Erten, Dr. Marwa Abdelhakim, Dr. Sherwin Kuo and Dr. Tiffany I. Leung, for providing us with the expert annotation of the data set.
Publisher Copyright:
© 2021 The Author(s)
PY - 2021/10/1
Y1 - 2021/10/1
N2 - ABSTR A C T The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treat-ment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.
AB - ABSTR A C T The effectiveness of machine learning models to provide accurate and consistent results in drug discovery and clinical decision support is strongly dependent on the quality of the data used. However, substantive amounts of open data that drive drug discovery suffer from a number of issues including inconsistent representation, inaccurate reporting, and incomplete context. For example, databases of FDA-approved drug indications used in computational drug repositioning studies do not distinguish between treatments that simply offer symptomatic relief from those that target the underlying pathology. Moreover, drug indication sources often lack proper provenance and have little overlap. Consequently, new predictions can be of poor quality as they offer little in the way of new insights. Hence, work remains to be done to establish higher quality databases of drug indications that are suitable for use in drug discovery and repositioning studies. Here, we report on the combination of weak supervision (i.e., programmatic labeling and crowdsourcing) and deep learning methods for relation extraction from DailyMed text to create a higher quality drug-disease relation dataset. The generated drug-disease relation data shows a high overlap with DrugCentral, a manually curated dataset. Using this dataset, we constructed a machine learning model to classify relations between drugs and diseases from text into four categories; treat-ment, symptomatic relief, contradiction, and effect, exhibiting an improvement of 15.5% with Bi-LSTM (F1 score of 71.8%) over the best performing discrete method. Access to high quality data is crucial to building accurate and reliable drug repurposing prediction models. Our work suggests how the combination of crowds, experts, and machine learning methods can go hand-in-hand to improve datasets and predictive models.
KW - Drug-disease relation classification
KW - Drug indications
KW - Drug data quality
KW - Drug repositioning
KW - Weak supervision
KW - Programmatic labeling
KW - Crowdsourcing
KW - Human-in-the-loop
KW - Machine learning
U2 - 10.1016/j.jbi.2021.103902
DO - 10.1016/j.jbi.2021.103902
M3 - Article
C2 - 34481057
SN - 1532-0464
VL - 122
JO - Journal of Biomedical Informatics
JF - Journal of Biomedical Informatics
M1 - 103902
ER -