Double machine learning and automated confounder selection: A cautionary tale

Paul Hunermund; Beyers Louw; Itamar Caspi

doi:10.1515/jci-2022-0078

Double machine learning and automated confounder selection: A cautionary tale

Paul Hunermund^*, Beyers Louw, Itamar Caspi

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Double machine learning (DML) has become an increasingly popular tool for automated variable selection in high-dimensional settings. Even though the ability to deal with a large number of potential covariates can render selection-on-observables assumptions more plausible, there is at the same time a growing risk that endogenous variables are included, which would lead to the violation of conditional independence. This article demonstrates that DML is very sensitive to the inclusion of only a few "bad controls" in the covariate space. The resulting bias varies with the nature of the theoretical causal model, which raises concerns about the feasibility of selecting control variables in a data-driven way.

Original language	English
Article number	20220078
Number of pages	12
Journal	Journal of Causal Inference
Volume	11
Issue number	1
DOIs	https://doi.org/10.1515/jci-2022-0078
Publication status	Published - 23 May 2023

Keywords

double/debiased machine learning
bad controls
backdoor adjustment
collider bias
causal hierarchy

Access to Document

10.1515/jci-2022-0078Licence: CC BY

Cite this

@article{f7304e2a9cfc4acd918feeba36e32de7,

title = "Double machine learning and automated confounder selection: A cautionary tale",

abstract = "Double machine learning (DML) has become an increasingly popular tool for automated variable selection in high-dimensional settings. Even though the ability to deal with a large number of potential covariates can render selection-on-observables assumptions more plausible, there is at the same time a growing risk that endogenous variables are included, which would lead to the violation of conditional independence. This article demonstrates that DML is very sensitive to the inclusion of only a few {"}bad controls{"} in the covariate space. The resulting bias varies with the nature of the theoretical causal model, which raises concerns about the feasibility of selecting control variables in a data-driven way.",

keywords = "double/debiased machine learning, bad controls, backdoor adjustment, collider bias, causal hierarchy",

author = "Paul Hunermund and Beyers Louw and Itamar Caspi",

note = "data source:",

year = "2023",

month = may,

day = "23",

doi = "10.1515/jci-2022-0078",

language = "English",

volume = "11",

journal = "Journal of Causal Inference",

issn = "2193-3677",

publisher = "De Gruyter",

number = "1",

}

TY - JOUR

T1 - Double machine learning and automated confounder selection: A cautionary tale

AU - Hunermund, Paul

AU - Louw, Beyers

AU - Caspi, Itamar

N1 - data source:

PY - 2023/5/23

Y1 - 2023/5/23

N2 - Double machine learning (DML) has become an increasingly popular tool for automated variable selection in high-dimensional settings. Even though the ability to deal with a large number of potential covariates can render selection-on-observables assumptions more plausible, there is at the same time a growing risk that endogenous variables are included, which would lead to the violation of conditional independence. This article demonstrates that DML is very sensitive to the inclusion of only a few "bad controls" in the covariate space. The resulting bias varies with the nature of the theoretical causal model, which raises concerns about the feasibility of selecting control variables in a data-driven way.

AB - Double machine learning (DML) has become an increasingly popular tool for automated variable selection in high-dimensional settings. Even though the ability to deal with a large number of potential covariates can render selection-on-observables assumptions more plausible, there is at the same time a growing risk that endogenous variables are included, which would lead to the violation of conditional independence. This article demonstrates that DML is very sensitive to the inclusion of only a few "bad controls" in the covariate space. The resulting bias varies with the nature of the theoretical causal model, which raises concerns about the feasibility of selecting control variables in a data-driven way.

KW - double/debiased machine learning

KW - bad controls

KW - backdoor adjustment

KW - collider bias

KW - causal hierarchy

U2 - 10.1515/jci-2022-0078

DO - 10.1515/jci-2022-0078

M3 - Article

SN - 2193-3677

VL - 11

JO - Journal of Causal Inference

JF - Journal of Causal Inference

IS - 1

M1 - 20220078

ER -