Silhouettes and Quasi Residual Plots for Neural Nets and Tree-based Classifiers

Jakob Raymaekers; Peter J. Rousseeuw

doi:10.1080/10618600.2022.2050249

Silhouettes and Quasi Residual Plots for Neural Nets and Tree-based Classifiers

Jakob Raymaekers, Peter J. Rousseeuw^*

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to a different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, that is, the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis. The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on datasets containing images, mixed features, and tweets. Supplementary materials for this article are available online.

Original language	English
Pages (from-to)	1332-1343
Number of pages	12
Journal	Journal of Computational and Graphical Statistics
Volume	31
Issue number	4
Early online date	4 Apr 2022
DOIs	https://doi.org/10.1080/10618600.2022.2050249
Publication status	Published - 2 Oct 2022

Keywords

Image data
Label bias
Mislabeling
Probability of alternative class
Supervised classification
Text analysis

Access to Document

10.1080/10618600.2022.2050249Licence: CC BY-NC-ND

Cite this

@article{75d8d9cf834e4268bc0f01544abb2d3e,

title = "Silhouettes and Quasi Residual Plots for Neural Nets and Tree-based Classifiers",

abstract = "Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to a different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, that is, the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis. The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on datasets containing images, mixed features, and tweets. Supplementary materials for this article are available online.",

keywords = "Image data, Label bias, Mislabeling, Probability of alternative class, Supervised classification, Text analysis",

author = "Jakob Raymaekers and Rousseeuw, {Peter J.}",

note = "data source: publicly shared datasets for illustration",

year = "2022",

month = oct,

day = "2",

doi = "10.1080/10618600.2022.2050249",

language = "English",

volume = "31",

pages = "1332--1343",

journal = "Journal of Computational and Graphical Statistics",

issn = "1061-8600",

publisher = "Taylor and Francis",

number = "4",

}

TY - JOUR

T1 - Silhouettes and Quasi Residual Plots for Neural Nets and Tree-based Classifiers

AU - Raymaekers, Jakob

AU - Rousseeuw, Peter J.

N1 - data source: publicly shared datasets for illustration

PY - 2022/10/2

Y1 - 2022/10/2

N2 - Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to a different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, that is, the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis. The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on datasets containing images, mixed features, and tweets. Supplementary materials for this article are available online.

AB - Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to a different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, that is, the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis. The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on datasets containing images, mixed features, and tweets. Supplementary materials for this article are available online.

KW - Image data

KW - Label bias

KW - Mislabeling

KW - Probability of alternative class

KW - Supervised classification

KW - Text analysis

U2 - 10.1080/10618600.2022.2050249

DO - 10.1080/10618600.2022.2050249

M3 - Article

SN - 1061-8600

VL - 31

SP - 1332

EP - 1343

JO - Journal of Computational and Graphical Statistics

JF - Journal of Computational and Graphical Statistics

IS - 4

ER -