Robust sparse canonical correlation analysis

Ines Wilms; Christophe Croux

doi:10.1186/s12918-016-0317-9

Robust sparse canonical correlation analysis

Ines Wilms^*, Christophe Croux

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. In genomics, CCA has become increasingly important to estimate the associations between gene expression data and DNA copy number change data. The identification of such associations might help to increase our understanding of the development of diseases such as cancer. However, these data sets are typically high-dimensional, containing a lot of variables relative to the number of objects. Moreover, the data sets might contain atypical observations since it is likely that objects react differently to treatments. We discuss a method for Robust Sparse CCA, thereby providing a solution to both issues. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. Robust methods can cope with atypical observations in the data.We illustrate the good performance of the Robust Sparse CCA method by several simulation studies and three biometric examples. Robust Sparse CCA considerably outperforms its main alternatives in (1) correctly detecting the main associations between the data sets, in (2) accurately estimating these associations, and in (3) detecting outliers.Robust Sparse CCA delivers interpretable canonical vectors, while at the same time coping with outlying observations. The proposed method is able to describe the associations between high-dimensional data sets, which are nowadays commonplace in genomics. Furthermore, the Robust Sparse CCA method allows to characterize outliers.

Original language	English
Article number	72
Number of pages	13
Journal	BMC Systems Biology
Volume	10
DOIs	https://doi.org/10.1186/s12918-016-0317-9
Publication status	Published - 11 Aug 2016
Externally published	Yes

Keywords

Canonical correlation analysis
Penalized estimation
Robust estimation
REGRESSION SHRINKAGE
REGULATORY NETWORKS
CLASSIFICATION
SELECTION
MODEL
LASSO
ASSOCIATION
EFFICIENT
MATRICES
SETS

Access to Document

10.1186/s12918-016-0317-9Licence: CC BY

Cite this

@article{a20aa8fbdb0d455c8bd3c06302ffa1b9,

title = "Robust sparse canonical correlation analysis",

abstract = "Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. In genomics, CCA has become increasingly important to estimate the associations between gene expression data and DNA copy number change data. The identification of such associations might help to increase our understanding of the development of diseases such as cancer. However, these data sets are typically high-dimensional, containing a lot of variables relative to the number of objects. Moreover, the data sets might contain atypical observations since it is likely that objects react differently to treatments. We discuss a method for Robust Sparse CCA, thereby providing a solution to both issues. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. Robust methods can cope with atypical observations in the data.We illustrate the good performance of the Robust Sparse CCA method by several simulation studies and three biometric examples. Robust Sparse CCA considerably outperforms its main alternatives in (1) correctly detecting the main associations between the data sets, in (2) accurately estimating these associations, and in (3) detecting outliers.Robust Sparse CCA delivers interpretable canonical vectors, while at the same time coping with outlying observations. The proposed method is able to describe the associations between high-dimensional data sets, which are nowadays commonplace in genomics. Furthermore, the Robust Sparse CCA method allows to characterize outliers.",

keywords = "Canonical correlation analysis, Penalized estimation, Robust estimation, REGRESSION SHRINKAGE, REGULATORY NETWORKS, CLASSIFICATION, SELECTION, MODEL, LASSO, ASSOCIATION, EFFICIENT, MATRICES, SETS",

author = "Ines Wilms and Christophe Croux",

note = "data source: Breastcancer data set available in Witten, D., Tibshirani, R. and Gross, S. (2011). Penalized Multivariate Analysis. R package version 1.0.7.1. Available on CRAN (http://cran.rproject.org/web/packages/PMA/index.html). Nutrimouse data available in Gonzalez, I., Dejean, S., Martin, P.G.P., Baccini, A. (2008). CCA: Canonical Correlation Analysis. R package version 1.2. Available on CRAN (https://cran.r-project.org/web/packages/CCA/index.html). Evaporation data available in Freund. R.J. (1979), Multicollinearity etc. some 'new' examples. American Statistical Association Proceedings of Statistical Computing Section, 111-112. ",

year = "2016",

month = aug,

day = "11",

doi = "10.1186/s12918-016-0317-9",

language = "English",

volume = "10",

journal = "BMC Systems Biology",

issn = "1752-0509",

publisher = "BioMed Central Ltd",

}

TY - JOUR

T1 - Robust sparse canonical correlation analysis

AU - Wilms, Ines

AU - Croux, Christophe

N1 - data source: Breastcancer data set available in Witten, D., Tibshirani, R. and Gross, S. (2011). Penalized Multivariate Analysis. R package version 1.0.7.1. Available on CRAN (http://cran.rproject.org/web/packages/PMA/index.html). Nutrimouse data available in Gonzalez, I., Dejean, S., Martin, P.G.P., Baccini, A. (2008). CCA: Canonical Correlation Analysis. R package version 1.2. Available on CRAN (https://cran.r-project.org/web/packages/CCA/index.html). Evaporation data available in Freund. R.J. (1979), Multicollinearity etc. some 'new' examples. American Statistical Association Proceedings of Statistical Computing Section, 111-112.

PY - 2016/8/11

Y1 - 2016/8/11

N2 - Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. In genomics, CCA has become increasingly important to estimate the associations between gene expression data and DNA copy number change data. The identification of such associations might help to increase our understanding of the development of diseases such as cancer. However, these data sets are typically high-dimensional, containing a lot of variables relative to the number of objects. Moreover, the data sets might contain atypical observations since it is likely that objects react differently to treatments. We discuss a method for Robust Sparse CCA, thereby providing a solution to both issues. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. Robust methods can cope with atypical observations in the data.We illustrate the good performance of the Robust Sparse CCA method by several simulation studies and three biometric examples. Robust Sparse CCA considerably outperforms its main alternatives in (1) correctly detecting the main associations between the data sets, in (2) accurately estimating these associations, and in (3) detecting outliers.Robust Sparse CCA delivers interpretable canonical vectors, while at the same time coping with outlying observations. The proposed method is able to describe the associations between high-dimensional data sets, which are nowadays commonplace in genomics. Furthermore, the Robust Sparse CCA method allows to characterize outliers.

AB - Canonical correlation analysis (CCA) is a multivariate statistical method which describes the associations between two sets of variables. The objective is to find linear combinations of the variables in each data set having maximal correlation. In genomics, CCA has become increasingly important to estimate the associations between gene expression data and DNA copy number change data. The identification of such associations might help to increase our understanding of the development of diseases such as cancer. However, these data sets are typically high-dimensional, containing a lot of variables relative to the number of objects. Moreover, the data sets might contain atypical observations since it is likely that objects react differently to treatments. We discuss a method for Robust Sparse CCA, thereby providing a solution to both issues. Sparse estimation produces canonical vectors with some of their elements estimated as exactly zero. As such, their interpretability is improved. Robust methods can cope with atypical observations in the data.We illustrate the good performance of the Robust Sparse CCA method by several simulation studies and three biometric examples. Robust Sparse CCA considerably outperforms its main alternatives in (1) correctly detecting the main associations between the data sets, in (2) accurately estimating these associations, and in (3) detecting outliers.Robust Sparse CCA delivers interpretable canonical vectors, while at the same time coping with outlying observations. The proposed method is able to describe the associations between high-dimensional data sets, which are nowadays commonplace in genomics. Furthermore, the Robust Sparse CCA method allows to characterize outliers.

KW - Canonical correlation analysis

KW - Penalized estimation

KW - Robust estimation

KW - REGRESSION SHRINKAGE

KW - REGULATORY NETWORKS

KW - CLASSIFICATION

KW - SELECTION

KW - MODEL

KW - LASSO

KW - ASSOCIATION

KW - EFFICIENT

KW - MATRICES

KW - SETS

U2 - 10.1186/s12918-016-0317-9

DO - 10.1186/s12918-016-0317-9

M3 - Article

C2 - 27516087

SN - 1752-0509

VL - 10

JO - BMC Systems Biology

JF - BMC Systems Biology

M1 - 72

ER -