TY - JOUR
T1 - Wikipedia on the CompTox Chemicals Dashboard
T2 - Connecting Resources to Enrich Public Chemical Data
AU - Sinclair, Gabriel
AU - Thillainadarajah, Inthirany
AU - Meyer, Brian
AU - Samano, Vicente
AU - Sivasupramaniam, Sakuntala
AU - Adams, Linda
AU - Willighagen, Egon L
AU - Richard, Ann M
AU - Walker, Martin
AU - Williams, Antony J
N1 - Funding Information:
The authors acknowledge the CompTox Chemicals Dashboard software development team for their dedicated efforts to the development of the Dashboard application as this application provides access to our curation work for the community. The authors appreciate our colleagues Katie Paul-Friedman, Louis Groff, and Mark Strynar for their feedback and comments on the manuscript. The information in this document has been funded wholly or in part by the U.S. Environmental Protection Agency.
Publisher Copyright:
© 2022 American Chemical Society.
PY - 2022/10/24
Y1 - 2022/10/24
N2 - The online encyclopedia Wikipedia aggregates a large amount of data on chemistry, encompassing well over 20,000 individual Wikipedia pages and serves the general public as well as the chemistry community. Many other chemical databases and services utilize these data, and previous projects have focused on methods to index, search, and extract it for review and use. We present a comprehensive effort that combines bulk automated data extraction over tens of thousands of pages, semiautomated data extraction over hundreds of pages, and fine-grained manual extraction of individual lists and compounds of interest. We then correlate these data with the existing contents of the U.S. Environmental Protection Agency's (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database. This was performed with a number of intentions including ensuring as complete a mapping as possible between the Dashboard and Wikipedia so that relevant snippets of the article are loaded for the user to review. Conflicts between Dashboard content and Wikipedia in terms of, for example, identifiers such as chemical registry numbers, names, and InChIs and structure-based collisions such as SMILES were identified and used as the basis of curation of both DSSTox and Wikipedia. This work also allowed us to evaluate available data for sets of chemicals of interest to the Agency, such as synthetic cannabinoids, and expand the content in DSSTox as appropriate. This work also led to improved bidirectional linkage of the detailed chemistry and usage information from Wikipedia with expert-curated structure and identifier data from DSSTox for a new list of nearly 20,000 chemicals. All of this work ultimately enhances the data mappings that allow for the display of the introduction of the Wikipedia article in the community-accessible web-based EPA Comptox Chemicals Dashboard, enhancing the user experience for the thousands of users per day accessing the resource.
AB - The online encyclopedia Wikipedia aggregates a large amount of data on chemistry, encompassing well over 20,000 individual Wikipedia pages and serves the general public as well as the chemistry community. Many other chemical databases and services utilize these data, and previous projects have focused on methods to index, search, and extract it for review and use. We present a comprehensive effort that combines bulk automated data extraction over tens of thousands of pages, semiautomated data extraction over hundreds of pages, and fine-grained manual extraction of individual lists and compounds of interest. We then correlate these data with the existing contents of the U.S. Environmental Protection Agency's (EPA) Distributed Structure-Searchable Toxicity (DSSTox) database. This was performed with a number of intentions including ensuring as complete a mapping as possible between the Dashboard and Wikipedia so that relevant snippets of the article are loaded for the user to review. Conflicts between Dashboard content and Wikipedia in terms of, for example, identifiers such as chemical registry numbers, names, and InChIs and structure-based collisions such as SMILES were identified and used as the basis of curation of both DSSTox and Wikipedia. This work also allowed us to evaluate available data for sets of chemicals of interest to the Agency, such as synthetic cannabinoids, and expand the content in DSSTox as appropriate. This work also led to improved bidirectional linkage of the detailed chemistry and usage information from Wikipedia with expert-curated structure and identifier data from DSSTox for a new list of nearly 20,000 chemicals. All of this work ultimately enhances the data mappings that allow for the display of the introduction of the Wikipedia article in the community-accessible web-based EPA Comptox Chemicals Dashboard, enhancing the user experience for the thousands of users per day accessing the resource.
U2 - 10.1021/acs.jcim.2c00886
DO - 10.1021/acs.jcim.2c00886
M3 - Article
C2 - 36215146
SN - 1549-9596
VL - 62
SP - 4888
EP - 4905
JO - Journal of Chemical Information and Modeling
JF - Journal of Chemical Information and Modeling
IS - 20
ER -