TY - GEN
T1 - A linked data representation for summary statistics and grouping criteria
AU - McCusker, James P.
AU - Dumontier, Michel
AU - Chari, Shruthi
AU - Luciano, Joanne S.
AU - McGuinness, Deborah L.
N1 - Funding Information:
Thank you to James Michaelis and John Erickson for feedback and examples. This work is supported by IBM Research AI through the AI Horizons Network.
Publisher Copyright:
Copyright © 2019 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
PY - 2019/1/1
Y1 - 2019/1/1
N2 - Summary statistics are fundamental to data science, and are the buidling blocks of statistical reasoning. Most of the data and statistics made available on government web sites are aggregate, however, until now, we have not had a suitable linked data representation available. We propose a way to express summary statistics across aggregate groups as linked data using Web Ontology Language (OWL) Class based sets, where members of the set contribute to the overall aggregate value. Additionally, many clinical studies in the biomedical field rely on demographic summaries of their study cohorts and the patients assigned to each arm. While most data query languages, including SPARQL, allow for computation of summary statistics, they do not provide a way to integrate those values back into the RDF graphs they were computed from. We represent this knowledge, that would otherwise be lost, through the use of OWL 2 punning semantics, the expression of aggregate grouping criteria as OWL classes with variables, and constructs from the Semanticscience Integrated Ontology (SIO), and the World Wide Web Consortium’s provenance ontology, PROV-O, providing interoperable representations that are well supported across the web of Linked Data. We evaluate these semantics using a Resource Description Framework (RDF) representation of patient case information from the Genomic Data Commons, a data portal from the National Cancer Institute.
AB - Summary statistics are fundamental to data science, and are the buidling blocks of statistical reasoning. Most of the data and statistics made available on government web sites are aggregate, however, until now, we have not had a suitable linked data representation available. We propose a way to express summary statistics across aggregate groups as linked data using Web Ontology Language (OWL) Class based sets, where members of the set contribute to the overall aggregate value. Additionally, many clinical studies in the biomedical field rely on demographic summaries of their study cohorts and the patients assigned to each arm. While most data query languages, including SPARQL, allow for computation of summary statistics, they do not provide a way to integrate those values back into the RDF graphs they were computed from. We represent this knowledge, that would otherwise be lost, through the use of OWL 2 punning semantics, the expression of aggregate grouping criteria as OWL classes with variables, and constructs from the Semanticscience Integrated Ontology (SIO), and the World Wide Web Consortium’s provenance ontology, PROV-O, providing interoperable representations that are well supported across the web of Linked Data. We evaluate these semantics using a Resource Description Framework (RDF) representation of patient case information from the Genomic Data Commons, a data portal from the National Cancer Institute.
KW - Data Exploration
KW - Data Science
KW - Interoperability
KW - Knowledge Representation
KW - Linked Data
KW - Provenance
KW - Summary Statistics
KW - Transparency
M3 - Conference article in proceeding
VL - 2549
T3 - CEUR Workshop Proceedings
BT - Computer Vision Winter Workshop 2023
T2 - 2019 Joint International Workshops on Sensors and Actuators on the Web, and Semantic Statistics
Y2 - 27 October 2019 through 27 October 2019
ER -