Entity resolution in disjoint graphs: an application on genealogical data

Hossein Rahmani*, Bijan Ranjbarsahraei, Gerhard Weiss, Karl Tuyls

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Entity Resolution (ER) is the process of identifying references referring to the same entity from one or more data sources. In the ER process, most existing approaches exploit the content information of references, categorized as content-based ER, or additionally consider linkage information among references, categorized as context-based ER. However, in new applications of ER, such as in the genealogical domain, the very limited linkage information among references results in a disjoint graph in which the existing content-/context-based ER techniques have very limited applicability. Therefore, in this paper we propose first, to use the homophily principle for augmentation of the original input graph by connecting the potential similar references, and second, to use a Random Walk based approach to consider contextual information available for each reference in the augmented graph. We evaluate the proposed method by applying it to a large genealogical dataset and we succeed to predict 420,000 reference matches with precision 92% and discover six novel and informative patterns among them which can not be detected in the original disjoint graph.
Original languageEnglish
Pages (from-to)455-475
Number of pages21
JournalIntelligent Data Analysis
Volume20
Issue number2
DOIs
Publication statusPublished - 2016

Keywords

  • Entity resolution
  • disjoint graphs
  • genealogy
  • RECORD LINKAGE
  • NETWORKS

Cite this