A machine learning approach to detecting fraudulent job types

Marcel Naudé; Kolawole John Adebayo; Rohan Nanda

doi:10.1007/s00146-022-01469-0

A machine learning approach to detecting fraudulent job types

Marcel Naudé^*, Kolawole John Adebayo, Rohan Nanda

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Job seekers find themselves increasingly duped and misled by fraudulent job advertisements, posing a threat to their privacy, security and well-being. There is a clear need for solutions that can protect innocent job seekers. Existing approaches to detecting fraudulent jobs do not scale well, function like a black-box, and lack interpretability, which is essential to guide applicants’ decision-making. Moreover, commonly used lexical features may be insufficient as the representation does not capture contextual semantics of the underlying document. Hence, this paper explores to what extent different categorizations of fraudulent jobs can be classified. In addition, this paper seeks to find what type of features are most relevant in classifying the type of fraudulent job. In this paper, we develop and validate a machine learning system for identifying identity theft, corporate identity theft and multi-level marketing amongst fraudulent job advertisements. We utilized four classes of features: empirical rule set-based features, bag-of-word models, most recent state-of-the-art word embeddings and transformer models for various machine learning classifiers. The machine learning models were validated by evaluating them on a publicly available job description dataset. Our results indicate that the word embeddings and transformer-based features consistently outperformed the handcrafted rule-set based features class. Ultimately, a Gradient Boosting classifier with a combination of empirical rule-set based features, parts-of-speech tags and bag-of-words vectors achieved the best performance with an F1-score of 0.88.

Original language	English
Pages (from-to)	1013-1024
Number of pages	12
Journal	AI and Society
Volume	38
Issue number	2
Early online date	22 May 2022
DOIs	https://doi.org/10.1007/s00146-022-01469-0
Publication status	Published - Apr 2023

Keywords

Fraud detection
Machine learning
Natural language processing
Online recruitment fraud
Transformers
word2vec

Access to Document

10.1007/s00146-022-01469-0Licence: CC BY

Cite this

@article{2d39e44bc8744558a55f610cc74c2cd7,

title = "A machine learning approach to detecting fraudulent job types",

abstract = "Job seekers find themselves increasingly duped and misled by fraudulent job advertisements, posing a threat to their privacy, security and well-being. There is a clear need for solutions that can protect innocent job seekers. Existing approaches to detecting fraudulent jobs do not scale well, function like a black-box, and lack interpretability, which is essential to guide applicants{\textquoteright} decision-making. Moreover, commonly used lexical features may be insufficient as the representation does not capture contextual semantics of the underlying document. Hence, this paper explores to what extent different categorizations of fraudulent jobs can be classified. In addition, this paper seeks to find what type of features are most relevant in classifying the type of fraudulent job. In this paper, we develop and validate a machine learning system for identifying identity theft, corporate identity theft and multi-level marketing amongst fraudulent job advertisements. We utilized four classes of features: empirical rule set-based features, bag-of-word models, most recent state-of-the-art word embeddings and transformer models for various machine learning classifiers. The machine learning models were validated by evaluating them on a publicly available job description dataset. Our results indicate that the word embeddings and transformer-based features consistently outperformed the handcrafted rule-set based features class. Ultimately, a Gradient Boosting classifier with a combination of empirical rule-set based features, parts-of-speech tags and bag-of-words vectors achieved the best performance with an F1-score of 0.88.",

keywords = "Fraud detection, Machine learning, Natural language processing, Online recruitment fraud, Transformers, word2vec",

author = "Marcel Naud{\'e} and Adebayo, {Kolawole John} and Rohan Nanda",

note = "Funding Information: Kolawole Adebayo has received funding from Enterprise Ireland{\textquoteright}s CareerFit-Plus Co-fund and the European Union{\textquoteright}s Horizon 2020 research and innovation programme under the Marie Sk{\l}odowska-Curie grant agreement No. 847402. Publisher Copyright: {\textcopyright} 2022, The Author(s).",

year = "2023",

month = apr,

doi = "10.1007/s00146-022-01469-0",

language = "English",

volume = "38",

pages = "1013--1024",

journal = "AI and Society",

issn = "0951-5666",

publisher = "Springer",

number = "2",

}

TY - JOUR

T1 - A machine learning approach to detecting fraudulent job types

AU - Naudé, Marcel

AU - Adebayo, Kolawole John

AU - Nanda, Rohan

N1 - Funding Information: Kolawole Adebayo has received funding from Enterprise Ireland’s CareerFit-Plus Co-fund and the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 847402. Publisher Copyright: © 2022, The Author(s).

PY - 2023/4

Y1 - 2023/4

N2 - Job seekers find themselves increasingly duped and misled by fraudulent job advertisements, posing a threat to their privacy, security and well-being. There is a clear need for solutions that can protect innocent job seekers. Existing approaches to detecting fraudulent jobs do not scale well, function like a black-box, and lack interpretability, which is essential to guide applicants’ decision-making. Moreover, commonly used lexical features may be insufficient as the representation does not capture contextual semantics of the underlying document. Hence, this paper explores to what extent different categorizations of fraudulent jobs can be classified. In addition, this paper seeks to find what type of features are most relevant in classifying the type of fraudulent job. In this paper, we develop and validate a machine learning system for identifying identity theft, corporate identity theft and multi-level marketing amongst fraudulent job advertisements. We utilized four classes of features: empirical rule set-based features, bag-of-word models, most recent state-of-the-art word embeddings and transformer models for various machine learning classifiers. The machine learning models were validated by evaluating them on a publicly available job description dataset. Our results indicate that the word embeddings and transformer-based features consistently outperformed the handcrafted rule-set based features class. Ultimately, a Gradient Boosting classifier with a combination of empirical rule-set based features, parts-of-speech tags and bag-of-words vectors achieved the best performance with an F1-score of 0.88.

AB - Job seekers find themselves increasingly duped and misled by fraudulent job advertisements, posing a threat to their privacy, security and well-being. There is a clear need for solutions that can protect innocent job seekers. Existing approaches to detecting fraudulent jobs do not scale well, function like a black-box, and lack interpretability, which is essential to guide applicants’ decision-making. Moreover, commonly used lexical features may be insufficient as the representation does not capture contextual semantics of the underlying document. Hence, this paper explores to what extent different categorizations of fraudulent jobs can be classified. In addition, this paper seeks to find what type of features are most relevant in classifying the type of fraudulent job. In this paper, we develop and validate a machine learning system for identifying identity theft, corporate identity theft and multi-level marketing amongst fraudulent job advertisements. We utilized four classes of features: empirical rule set-based features, bag-of-word models, most recent state-of-the-art word embeddings and transformer models for various machine learning classifiers. The machine learning models were validated by evaluating them on a publicly available job description dataset. Our results indicate that the word embeddings and transformer-based features consistently outperformed the handcrafted rule-set based features class. Ultimately, a Gradient Boosting classifier with a combination of empirical rule-set based features, parts-of-speech tags and bag-of-words vectors achieved the best performance with an F1-score of 0.88.

KW - Fraud detection

KW - Machine learning

KW - Natural language processing

KW - Online recruitment fraud

KW - Transformers

KW - word2vec

U2 - 10.1007/s00146-022-01469-0

DO - 10.1007/s00146-022-01469-0

M3 - Article

SN - 0951-5666

VL - 38

SP - 1013

EP - 1024

JO - AI and Society

JF - AI and Society

IS - 2

ER -