Deep Generative Models for Synthetic Data: A Survey

Peter Eigenschink; Thomas Reutterer; Stefan Vamosi; Ralf Vamosi; Chang Sun; Klaudius Kalcher

doi:10.1109/ACCESS.2023.3275134

Deep Generative Models for Synthetic Data: A Survey

Peter Eigenschink, Thomas Reutterer^*, Stefan Vamosi, Ralf Vamosi, Chang Sun, Klaudius Kalcher

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.

Original language	English
Pages (from-to)	47304-47320
Number of pages	17
Journal	IEEE Access
Volume	11
DOIs	https://doi.org/10.1109/ACCESS.2023.3275134
Publication status	Published - 2023

Keywords

Data models
Synthetic data
Measurement
Biological system modeling
Analytical models
Training data
Medical services
Artificial intelligence
big data
deep learning
generative models
neural networks
synthetic data
privacy
NATURAL-LANGUAGE GENERATION
PREDICTION

Access to Document

10.1109/ACCESS.2023.3275134Licence: CC BY-NC-ND

Cite this

@article{02f1de1b695d4fe5b1f82e661b74bfaf,

title = "Deep Generative Models for Synthetic Data: A Survey",

abstract = "A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.",

keywords = "Data models, Synthetic data, Measurement, Biological system modeling, Analytical models, Training data, Medical services, Artificial intelligence, big data, deep learning, generative models, neural networks, synthetic data, privacy, NATURAL-LANGUAGE GENERATION, PREDICTION",

author = "Peter Eigenschink and Thomas Reutterer and Stefan Vamosi and Ralf Vamosi and Chang Sun and Klaudius Kalcher",

year = "2023",

doi = "10.1109/ACCESS.2023.3275134",

language = "English",

volume = "11",

pages = "47304--47320",

journal = "IEEE Access",

issn = "2169-3536",

publisher = "IEEE",

}

TY - JOUR

T1 - Deep Generative Models for Synthetic Data

T2 - A Survey

AU - Eigenschink, Peter

AU - Reutterer, Thomas

AU - Vamosi, Stefan

AU - Vamosi, Ralf

AU - Sun, Chang

AU - Kalcher, Klaudius

PY - 2023

Y1 - 2023

N2 - A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.

AB - A growing interest in synthetic data has stimulated the development and advancement of a large variety of deep generative models for a wide range of applications. However, as this research has progressed, its streams have become more specialized and disconnected from one another. This is why models for synthesizing text data for natural language processing cannot readily be compared to models for synthesizing health records anymore. To mitigate this isolation, we propose a data-driven evaluation framework for generative models for synthetic sequential data, an important and challenging sub-category of synthetic data, based on five high-level criteria: representativeness, novelty, realism, diversity and coherence of a synthetic data-set relative to the original data-set regardless of the models' internal structures. The criteria reflect requirements different domains impose on synthetic data and allow model users to assess the quality of synthetic data across models. In a critical review of generative models for sequential data, we examine and compare the importance of each performance criterion in numerous domains. We find that realism and coherence are more important for synthetic data natural language, speech and audio processing tasks. At the same time, novelty and representativeness are more important for healthcare and mobility data. We also find that measurement of representativeness is often accomplished using statistical metrics, realism by using human judgement, and novelty using privacy tests.

KW - Data models

KW - Synthetic data

KW - Measurement

KW - Biological system modeling

KW - Analytical models

KW - Training data

KW - Medical services

KW - Artificial intelligence

KW - big data

KW - deep learning

KW - generative models

KW - neural networks

KW - synthetic data

KW - privacy

KW - NATURAL-LANGUAGE GENERATION

KW - PREDICTION

U2 - 10.1109/ACCESS.2023.3275134

DO - 10.1109/ACCESS.2023.3275134

M3 - Article

SN - 2169-3536

VL - 11

SP - 47304

EP - 47320

JO - IEEE Access

JF - IEEE Access

ER -