Fast and optimal algorithm for case-control matching using registry data: application on the antibiotics use of colorectal cancer patients

P. Mamouris; V. Nassiri; G. Molenberghs; M. van den Akker; J. van der Meer; B. Vaes

doi:10.1186/s12874-021-01256-3

Fast and optimal algorithm for case-control matching using registry data: application on the antibiotics use of colorectal cancer patients

P. Mamouris^*, V. Nassiri, G. Molenberghs, M. van den Akker, J. van der Meer, B. Vaes

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

Background In case-control studies most algorithms allow the controls to be sampled several times, which is not always optimal. If many controls are available and adjustment for several covariates is necessary, matching without replacement might increase statistical efficiency. Comparing similar units when having observational data is of utter importance, since confounding and selection bias is present. The aim was twofold, firstly to create a method that accommodates the option that a control is not resampled, and second, to display several scenarios that identify changes of Odds Ratios (ORs) while increasing the balance of the matched sample. Methods The algorithm was derived in an iterative way starting from the pre-processing steps to derive the data until its application in a study to investigate the risk of antibiotics on colorectal cancer in the INTEGO registry (Flanders, Belgium). Different scenarios were developed to investigate the fluctuation of ORs using the combination of exact and varying variables with or without replacement of controls. To achieve balance in the population, we introduced the Comorbidity Index (CI) variable, which is the sum of chronic diseases as a means to have comparable units for drawing valid associations. Results This algorithm is fast and optimal. We simulated data and demonstrated that the run-time of matching even with millions of patients is minimal. Optimal, since the closest controls is always captured (using the appropriate ordering and by creating some auxiliary variables), and in the scenario that a case has only one control, we assure that this control will be matched to this case, thus maximizing the cases to be used in the analysis. In total, 72 different scenarios were displayed indicating the fluctuation of ORs, and revealing patterns, especially a drop when balancing the population. Conclusions We created an optimal and computationally efficient algorithm to derive a matched case-control sample with and without replacement of controls. The code and the functions are publicly available as an open source in an R package. Finally, we emphasize the importance of displaying several scenarios and assess the difference of ORs while using an index to balance population in observational data.

Original language	English
Article number	62
Number of pages	9
Journal	BMC Medical Research Methodology
Volume	21
Issue number	1
DOIs	https://doi.org/10.1186/s12874-021-01256-3
Publication status	Published - 2 Apr 2021

Keywords

Case-control
Colorectal cancer
Comorbidity index
Optimal matching
BIAS
GENERAL-PRACTICE
RISK
COHORT

Access to Document

10.1186/s12874-021-01256-3Licence: CC BY

Cite this

@article{e193b0d1ef474d3db9c2ac8178c279e8,

title = "Fast and optimal algorithm for case-control matching using registry data: application on the antibiotics use of colorectal cancer patients",

abstract = "Background In case-control studies most algorithms allow the controls to be sampled several times, which is not always optimal. If many controls are available and adjustment for several covariates is necessary, matching without replacement might increase statistical efficiency. Comparing similar units when having observational data is of utter importance, since confounding and selection bias is present. The aim was twofold, firstly to create a method that accommodates the option that a control is not resampled, and second, to display several scenarios that identify changes of Odds Ratios (ORs) while increasing the balance of the matched sample. Methods The algorithm was derived in an iterative way starting from the pre-processing steps to derive the data until its application in a study to investigate the risk of antibiotics on colorectal cancer in the INTEGO registry (Flanders, Belgium). Different scenarios were developed to investigate the fluctuation of ORs using the combination of exact and varying variables with or without replacement of controls. To achieve balance in the population, we introduced the Comorbidity Index (CI) variable, which is the sum of chronic diseases as a means to have comparable units for drawing valid associations. Results This algorithm is fast and optimal. We simulated data and demonstrated that the run-time of matching even with millions of patients is minimal. Optimal, since the closest controls is always captured (using the appropriate ordering and by creating some auxiliary variables), and in the scenario that a case has only one control, we assure that this control will be matched to this case, thus maximizing the cases to be used in the analysis. In total, 72 different scenarios were displayed indicating the fluctuation of ORs, and revealing patterns, especially a drop when balancing the population. Conclusions We created an optimal and computationally efficient algorithm to derive a matched case-control sample with and without replacement of controls. The code and the functions are publicly available as an open source in an R package. Finally, we emphasize the importance of displaying several scenarios and assess the difference of ORs while using an index to balance population in observational data.",

keywords = "Case-control, Colorectal cancer, Comorbidity index, Optimal matching, BIAS, GENERAL-PRACTICE, RISK, COHORT",

author = "P. Mamouris and V. Nassiri and G. Molenberghs and {van den Akker}, M. and {van der Meer}, J. and B. Vaes",

year = "2021",

month = apr,

day = "2",

doi = "10.1186/s12874-021-01256-3",

language = "English",

volume = "21",

journal = "BMC Medical Research Methodology",

issn = "1471-2288",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Fast and optimal algorithm for case-control matching using registry data: application on the antibiotics use of colorectal cancer patients

AU - Mamouris, P.

AU - Nassiri, V.

AU - Molenberghs, G.

AU - van den Akker, M.

AU - van der Meer, J.

AU - Vaes, B.

PY - 2021/4/2

Y1 - 2021/4/2

N2 - Background In case-control studies most algorithms allow the controls to be sampled several times, which is not always optimal. If many controls are available and adjustment for several covariates is necessary, matching without replacement might increase statistical efficiency. Comparing similar units when having observational data is of utter importance, since confounding and selection bias is present. The aim was twofold, firstly to create a method that accommodates the option that a control is not resampled, and second, to display several scenarios that identify changes of Odds Ratios (ORs) while increasing the balance of the matched sample. Methods The algorithm was derived in an iterative way starting from the pre-processing steps to derive the data until its application in a study to investigate the risk of antibiotics on colorectal cancer in the INTEGO registry (Flanders, Belgium). Different scenarios were developed to investigate the fluctuation of ORs using the combination of exact and varying variables with or without replacement of controls. To achieve balance in the population, we introduced the Comorbidity Index (CI) variable, which is the sum of chronic diseases as a means to have comparable units for drawing valid associations. Results This algorithm is fast and optimal. We simulated data and demonstrated that the run-time of matching even with millions of patients is minimal. Optimal, since the closest controls is always captured (using the appropriate ordering and by creating some auxiliary variables), and in the scenario that a case has only one control, we assure that this control will be matched to this case, thus maximizing the cases to be used in the analysis. In total, 72 different scenarios were displayed indicating the fluctuation of ORs, and revealing patterns, especially a drop when balancing the population. Conclusions We created an optimal and computationally efficient algorithm to derive a matched case-control sample with and without replacement of controls. The code and the functions are publicly available as an open source in an R package. Finally, we emphasize the importance of displaying several scenarios and assess the difference of ORs while using an index to balance population in observational data.

AB - Background In case-control studies most algorithms allow the controls to be sampled several times, which is not always optimal. If many controls are available and adjustment for several covariates is necessary, matching without replacement might increase statistical efficiency. Comparing similar units when having observational data is of utter importance, since confounding and selection bias is present. The aim was twofold, firstly to create a method that accommodates the option that a control is not resampled, and second, to display several scenarios that identify changes of Odds Ratios (ORs) while increasing the balance of the matched sample. Methods The algorithm was derived in an iterative way starting from the pre-processing steps to derive the data until its application in a study to investigate the risk of antibiotics on colorectal cancer in the INTEGO registry (Flanders, Belgium). Different scenarios were developed to investigate the fluctuation of ORs using the combination of exact and varying variables with or without replacement of controls. To achieve balance in the population, we introduced the Comorbidity Index (CI) variable, which is the sum of chronic diseases as a means to have comparable units for drawing valid associations. Results This algorithm is fast and optimal. We simulated data and demonstrated that the run-time of matching even with millions of patients is minimal. Optimal, since the closest controls is always captured (using the appropriate ordering and by creating some auxiliary variables), and in the scenario that a case has only one control, we assure that this control will be matched to this case, thus maximizing the cases to be used in the analysis. In total, 72 different scenarios were displayed indicating the fluctuation of ORs, and revealing patterns, especially a drop when balancing the population. Conclusions We created an optimal and computationally efficient algorithm to derive a matched case-control sample with and without replacement of controls. The code and the functions are publicly available as an open source in an R package. Finally, we emphasize the importance of displaying several scenarios and assess the difference of ORs while using an index to balance population in observational data.

KW - Case-control

KW - Colorectal cancer

KW - Comorbidity index

KW - Optimal matching

KW - BIAS

KW - GENERAL-PRACTICE

KW - RISK

KW - COHORT

U2 - 10.1186/s12874-021-01256-3

DO - 10.1186/s12874-021-01256-3

M3 - Article

C2 - 33810785

SN - 1471-2288

VL - 21

JO - BMC Medical Research Methodology

JF - BMC Medical Research Methodology

IS - 1

M1 - 62

ER -