Principal Balances of Compositional Data for Regression and Classification using Partial Least Squares

V. Nesrstová*, Ines Wilms, J. Palarea-Albaladejo, P. Filzmoser, J. A. Martin-Fernández, D. Friedecky, K. Hron

*Corresponding author for this work

Research output: Working paper / PreprintWorking paper

Abstract

High-dimensional compositional data are commonplace in the modern omics sciences amongst others. Analysis of compositional data requires a proper choice of orthonormal coordinate representation as their relative nature is not compatible with the direct use of standard statistical methods. Principal balances, a specific class of log-ratio coordinates, are well suited to this context since they are constructed in such a way that the first few coordinates capture most of the variability in the original data. Focusing on regression and classification problems in high dimensions, we propose a novel Partial Least Squares (PLS) based procedure to construct principal balances that maximize explained variability of the response variable and notably facilitates interpretability when compared to the ordinary PLS formulation. The proposed PLS principal balance approach can be understood as a generalized version of common logcontrast models, since multiple orthonormal (instead of one) logcontrasts are estimated simultaneously. We demonstrate the performance of the method using both simulated and real data sets.
Original languageEnglish
PublisherCornell University - arXiv
Number of pages26
Publication statusPublished - 2022

Publication series

SeriesarXiv.org
Number2211.01686
ISSN2331-8422

Keywords

  • COMPOSITIONAL DATA
  • balance coordinates
  • PLS regression and classification
  • high-dimensional data
  • metabolomic data

Cite this