Transforming variables to central normality

Jakob Raymaekers, Peter J. Rousseeuw*

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

Abstract

Many real data sets contain numerical features (variables) whose distribution is far from normal (Gaussian). Instead, their distribution is often skewed. In order to handle such data it is customary to preprocess the variables to make them more normal. The Box-Cox and Yeo-Johnson transformations are well-known tools for this. However, the standard maximum likelihood estimator of their transformation parameter is highly sensitive to outliers, and will often try to move outliers inward at the expense of the normality of the central part of the data. We propose a modification of these transformations as well as an estimator of the transformation parameter that is robust to outliers, so the transformed data can be approximately normal in the center and a few outliers may deviate from it. It compares favorably to existing techniques in an extensive simulation study and on real data.

Original languageEnglish
Pages (from-to)4953-4975
Number of pages23
JournalMachine Learning
Volume113
Issue number8
Early online date21 Mar 2021
DOIs
Publication statusPublished - Aug 2024
Externally publishedYes

Keywords

  • Anomaly detection
  • Data preprocessing
  • Feature transformation
  • Outliers
  • Symmetrization

Fingerprint

Dive into the research topics of 'Transforming variables to central normality'. Together they form a unique fingerprint.

Cite this