Test-driven Evaluation of Linked Data Quality

Dimitris Kontokostas; Patrick Westphal; Soeren Auer; Sebastian Hellmann; Jens Lehmann; Roland Cornelissen; Amrapali Zaveri

doi:10.1145/2566486.2568002

Test-driven Evaluation of Linked Data Quality

Dimitris Kontokostas^*, Patrick Westphal, Soeren Auer, Sebastian Hellmann, Jens Lehmann, Roland Cornelissen, Amrapali Zaveri

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic

Abstract

Linked Open Data (LOD) comprises of an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowd-sourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue, that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with LOV. One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

Original language	English
Title of host publication	Proceedings of the 23rd International Conference on World Wide Web
Publisher	International World Wide Web Conferences Steering Committee
Pages	747-758
Number of pages	12
ISBN (Print)	978-1-4503-2744-2
DOIs	https://doi.org/10.1145/2566486.2568002
Publication status	Published - 2014
Externally published	Yes

Publication series

Series	WWW '14

Keywords

2014 group_aksw dllearner MOLE sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:lod2 sys:relevantFor:geoknow topic_QualityAnalysis lod2page lehmann kontokostas rdfunit dataquality westphal

Access to Document

10.1145/2566486.2568002

http://svn.aksw.org/papers/2014/WWW_Databugger/public.pdf

Cite this

@inproceedings{83ae7b161b0346f3998e70b89ababf12,

title = "Test-driven Evaluation of Linked Data Quality",

abstract = "Linked Open Data (LOD) comprises of an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowd-sourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue, that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with LOV. One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.",

keywords = "2014 group_aksw dllearner MOLE sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:lod2 sys:relevantFor:geoknow topic_QualityAnalysis lod2page lehmann kontokostas rdfunit dataquality westphal",

author = "Dimitris Kontokostas and Patrick Westphal and Soeren Auer and Sebastian Hellmann and Jens Lehmann and Roland Cornelissen and Amrapali Zaveri",

year = "2014",

doi = "10.1145/2566486.2568002",

language = "English",

isbn = "978-1-4503-2744-2",

series = "WWW '14",

publisher = "International World Wide Web Conferences Steering Committee",

pages = "747--758",

booktitle = "Proceedings of the 23rd International Conference on World Wide Web",

address = "Switzerland",

}

Test-driven Evaluation of Linked Data Quality. / Kontokostas, Dimitris; Westphal, Patrick; Auer, Soeren et al.
Proceedings of the 23rd International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2014. p. 747-758 (WWW '14).

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic

TY - GEN

T1 - Test-driven Evaluation of Linked Data Quality

AU - Kontokostas, Dimitris

AU - Westphal, Patrick

AU - Auer, Soeren

AU - Hellmann, Sebastian

AU - Lehmann, Jens

AU - Cornelissen, Roland

AU - Zaveri, Amrapali

PY - 2014

Y1 - 2014

N2 - Linked Open Data (LOD) comprises of an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowd-sourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue, that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with LOV. One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

AB - Linked Open Data (LOD) comprises of an unprecedented volume of structured data on the Web. However, these datasets are of varying quality ranging from extensively curated datasets to crowd-sourced or extracted data of often relatively low quality. We present a methodology for test-driven quality assessment of Linked Data, which is inspired by test-driven software development. We argue, that vocabularies, ontologies and knowledge bases should be accompanied by a number of test cases, which help to ensure a basic level of quality. We present a methodology for assessing the quality of linked data resources, based on a formalization of bad smells and data quality problems. Our formalization employs SPARQL query templates, which are instantiated into concrete quality test case queries. Based on an extensive survey, we compile a comprehensive library of data quality test case patterns. We perform automatic test case instantiation based on schema constraints or semi-automatically enriched schemata and allow the user to generate specific test case instantiations that are applicable to a schema or dataset. We provide an extensive evaluation of five LOD datasets, manual test case instantiation for five schemas and automatic test case instantiations for all available schemata registered with LOV. One of the main advantages of our approach is that domain specific semantics can be encoded in the data quality test cases, thus being able to discover data quality problems beyond conventional quality heuristics.

KW - 2014 group_aksw dllearner MOLE sys:relevantFor:infai sys:relevantFor:bis sys:relevantFor:lod2 sys:relevantFor:geoknow topic_QualityAnalysis lod2page lehmann kontokostas rdfunit dataquality westphal

U2 - 10.1145/2566486.2568002

DO - 10.1145/2566486.2568002

M3 - Conference article in proceeding

SN - 978-1-4503-2744-2

T3 - WWW '14

SP - 747

EP - 758

BT - Proceedings of the 23rd International Conference on World Wide Web

PB - International World Wide Web Conferences Steering Committee

ER -