TY - JOUR
T1 - CTGAN-driven synthetic data generation
T2 - A multidisciplinary, expert-guided approach (TIMA)
AU - Parise, Orlando
AU - Kronenberger, Rani
AU - Parise, Gianmarco
AU - de Asmundis, Carlo
AU - Gelsomino, Sandro
AU - La Meir, Mark
PY - 2025/2
Y1 - 2025/2
N2 - OBJECTIVE: We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process. METHODS: The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data. RESULTS: Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%. CONCLUSIONS: Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.
AB - OBJECTIVE: We generated synthetic data starting from a population of two hundred thirty-eight adults SARS-CoV-2 positive patients admitted to the University Hospital of Brussels, Belgium, in 2020, utilizing a Conditional Tabular Generative Adversarial Network (CTGAN)-based technique with the aim of testing the performance, representativeness, realism, novelty, and diversity of synthetic data generated from a small patient sample. A Multidisciplinary Approach (TIMA) incorporates active participation from a medical team throughout the various stages of this process. METHODS: The TIMA committee scrutinized data for inconsistencies, implementing stringent rules for variables unlearned by the system. A sensitivity analysis determined 100,000 epochs, leading to the generation of 10,000 synthetic data. The model's performance was tested using a general-purpose dataset, comparing real and synthetic data. RESULTS: Outcomes indicate the robustness of our model, with an average contingency score of 0.94 across variable pairs in synthetic and real data. Continuous variables exhibited a median correlation similarity score of 0.97. Novelty received a top score of 1. Principal Component Analysis (PCA) on synthetic values demonstrated diversity, as no patient pair displayed a zero or close-to-zero value distance. Remarkably, the TIMA committee's evaluation revealed that synthetic data was recognized as authentic by nearly 100%. CONCLUSIONS: Our trained model exhibited commendable performance, yielding high representativeness in the synthetic dataset compared to the original. The synthetic dataset proved realistic, boasting elevated levels of novelty and diversity.
KW - COVID-19
KW - Generative artificial intelligence
KW - Pandemic
KW - SARS-CoV-2
KW - Synthetic structured data
U2 - 10.1016/j.cmpb.2024.108523
DO - 10.1016/j.cmpb.2024.108523
M3 - Article
SN - 0169-2607
VL - 259
SP - 108523
JO - Computer Methods and Programs in Biomedicine
JF - Computer Methods and Programs in Biomedicine
M1 - 108523
ER -