Abstract
Given a data set of email messages we are interested in how to resolve aliases and disambiguate authors even if their names are misspelled, if they use completely different email addresses or if they deliberately use aliases. This is done by using a combination of string similarity metrics and techniques from authorship attribution and link analysis. These techniques are combined by using a voting algorithm that is based on a Support Vector Machine. The approach is tested on a cleaned subset of the ENRON email data set. The results show that a combination of Jaro-Winkler email address similarity, Support Vector Machine on writing style attributes and Jaccard similarity of the link network outperforms the use of each of these techniques separately.
Original language | English |
---|---|
Publication status | Published - 25 Oct 2012 |
Event | 24th Benelux Conference on Artificial Intelligence - Maastricht, Netherlands Duration: 25 Oct 2012 → 26 Oct 2012 |
Conference
Conference | 24th Benelux Conference on Artificial Intelligence |
---|---|
Abbreviated title | BNAIC 2012 |
Country/Territory | Netherlands |
City | Maastricht |
Period | 25/10/12 → 26/10/12 |