Wikidata subsetting: Approaches, tools, and evaluation

Seyed amir Hosseini beghaeiraveri; Jose emilio Labra gayo; Andra Waagmeester; Ammar Ammar; Carolina Gonzalez; Denise Slenter; Sabah Ul-Hasan; Egon Willighagen; Fiona Mcneill; Alasdair j.g. Gray; Lucie-Aimée Kaffee; Simon Razniewski; Pavlos Vougiouklis

doi:10.3233/SW-233491

Wikidata subsetting: Approaches, tools, and evaluation

Seyed amir Hosseini beghaeiraveri^*, Jose emilio Labra gayo, Andra Waagmeester, Ammar Ammar, Carolina Gonzalez, Denise Slenter, Sabah Ul-Hasan, Egon Willighagen, Fiona Mcneill, Alasdair j.g. Gray, Lucie-Aimée Kaffee (Editor), Simon Razniewski (Editor), Pavlos Vougiouklis (Editor)

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.

Original language	English
Pages (from-to)	1-27
Number of pages	27
Journal	Semantic web
DOIs	https://doi.org/10.3233/SW-233491
Publication status	E-pub ahead of print - 27 Dec 2023

Keywords

wikidata
subsetting

Access to Document

10.3233/SW-233491Licence: Free access - publisher

https://www.medra.org/servlet/aliasResolver?alias=iospress&doi=10.3233/SW-233491

Cite this

Hosseini beghaeiraveri, S. A., Labra gayo, J. E., Waagmeester, A., Ammar, A., Gonzalez, C., Slenter, D., Ul-Hasan, S., Willighagen, E., Mcneill, F., Gray, A. J. G., Kaffee, L.-A. (Ed.), Razniewski, S. (Ed.), & Vougiouklis, P. (Ed.) (2023). Wikidata subsetting: Approaches, tools, and evaluation. Semantic web, 1-27. Advance online publication. https://doi.org/10.3233/SW-233491

@article{043415644ec44f97b471eacb7d627281,

title = "Wikidata subsetting: Approaches, tools, and evaluation",

abstract = "Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.",

keywords = "wikidata, subsetting",

author = "{Hosseini beghaeiraveri}, {Seyed amir} and {Labra gayo}, {Jose emilio} and Andra Waagmeester and Ammar Ammar and Carolina Gonzalez and Denise Slenter and Sabah Ul-Hasan and Egon Willighagen and Fiona Mcneill and Gray, {Alasdair j.g.} and Lucie-Aim{\'e}e Kaffee and Simon Razniewski and Pavlos Vougiouklis",

year = "2023",

month = dec,

day = "27",

doi = "10.3233/SW-233491",

language = "English",

pages = "1--27",

journal = "Semantic web",

issn = "1570-0844",

publisher = "IOS Press",

}

TY - JOUR

T1 - Wikidata subsetting

T2 - Approaches, tools, and evaluation

AU - Hosseini beghaeiraveri, Seyed amir

AU - Labra gayo, Jose emilio

AU - Waagmeester, Andra

AU - Ammar, Ammar

AU - Gonzalez, Carolina

AU - Slenter, Denise

AU - Ul-Hasan, Sabah

AU - Willighagen, Egon

AU - Mcneill, Fiona

AU - Gray, Alasdair j.g.

A2 - Kaffee, Lucie-Aimée

A2 - Razniewski, Simon

A2 - Vougiouklis, Pavlos

PY - 2023/12/27

Y1 - 2023/12/27

N2 - Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.

AB - Wikidata is a massive Knowledge Graph (KG), including more than 100 million data items and nearly 1.5 billion statements, covering a wide range of topics such as geography, history, scholarly articles, and life science data. The large volume of Wikidata is difficult to handle for research purposes; many researchers cannot afford the costs of hosting 100 GB of data. While Wikidata provides a public SPARQL endpoint, it can only be used for short-running queries. Often, researchers only require a limited range of data from Wikidata focusing on a particular topic for their use case. Subsetting is the process of defining and extracting the required data range from the KG; this process has received increasing attention in recent years. Specific tools and several approaches have been developed for subsetting, which have not been evaluated yet. In this paper, we survey the available subsetting approaches, introducing their general strengths and weaknesses, and evaluate four practical tools specific for Wikidata subsetting – WDSub, KGTK, WDumper, and WDF – in terms of execution performance, extraction accuracy, and flexibility in defining the subsets. Results show that all four tools have a minimum of 99.96% accuracy in extracting defined items and 99.25% in extracting statements. The fastest tool in extraction is WDF, while the most flexible tool is WDSub. During the experiments, multiple subset use cases have been defined and the extracted subsets have been analyzed, obtaining valuable information about the variety and quality of Wikidata, which would otherwise not be possible through the public Wikidata SPARQL endpoint.

KW - wikidata

KW - subsetting

U2 - 10.3233/SW-233491

DO - 10.3233/SW-233491

M3 - Article

SN - 1570-0844

SP - 1

EP - 27

JO - Semantic web

JF - Semantic web

ER -