A hybrid approach for robust multilingual toponym extraction and disambiguation

Mena B. Habib, Maurice van Keulen

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

Abstract

Toponym extraction and disambiguation are key topics recently addressed by fields of information extraction and geographical information retrieval. Toponym extraction and disambiguation are highly dependent processes. Not only toponym extraction effectiveness affects disambiguation, but also disambiguation results may help improving extraction accuracy. In this paper we propose a hybrid toponym extraction approach based on hidden markov models (hmm) and support vector machines (svm). Hidden markov model is used for extraction with high recall and low precision. Then svm is used to find false positives based on informativeness features and coherence features derived from the disambiguation results. Experimental results conducted with a set of descriptions of holiday homes with the aim to extract and disambiguate toponyms showed that the proposed approach outperform the state of the art methods of extraction and also proved to be robust. Robustness is proved on three aspects: language independence, high and low hmm threshold settings, and limited training data.keywordssupport vector machinehide markov modelname entity recognitioninverse document frequencyentity recognitionthese keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Original languageEnglish
Title of host publicationProceedings of the International Conference on Language Processing and Intelligent Information Systems (LPIIS 2013), Warsaw, Poland
Place of PublicationBerlin
PublisherSpringer Verlag
DOIs
Publication statusPublished - 1 Jun 2013
Externally publishedYes

Publication series

SeriesLecture Notes in Computer Science

Keywords

  • Toponym Extraction
  • Toponyms Disambiguation
  • Hybrid System
  • Multilingual Extraction and Disambiguation.

Cite this