Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error

Hali Lindsay; Johannes Tröger; Mario Mina; Nicklas Linz; Philipp Müller; Jan Alexandersson; Inez Ramakers

Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error

Hali Lindsay, Johannes Tröger, Mario Mina, Nicklas Linz, Philipp Müller, Jan Alexandersson, Inez Ramakers

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

Abstract

Training classification models on clinical speech is a time-saving and effective solution for many healthcare challenges, such as screening for Alzheimer’s Disease over the phone. One of the primary limiting factors of the success of artificial intelligence (AI) solutions is the amount of relevant data available. Clinical data is expensive to collect, not sufficient for large-scale machine learning or neural methods, and often not shareable between institutions due to data protection laws. With the increasing demand for AI in health systems, generating synthetic clinical data that maintains the nuance of underlying patient pathology is the next pressing task. Previous work has shown that automated evaluation of clinical speech tasks via automatic speech recognition (ASR) is comparable to manually annotated results in diagnostic scenarios even though ASR systems produce errors during the transcription process. In this work, we propose to generate synthetic clinical data by simulating ASR deletion errors on the transcript to produce additional data. We compare the synthetic data to the real data with traditional machine learning methods to test the feasibility of the proposed method. Using a dataset of 50 cognitively impaired and 50 control Dutch speakers, ten additional data points are synthetically generated for each subject, increasing the training size for 100 to 1000 training points. We find consistent and comparable performance of models trained on only synthetic data (AUC=0.77) to real data (AUC=0.77) in a variety of traditional machine learning scenarios. Additionally, linear models are not able to distinguish between real and synthetic data.

Original language	English
Title of host publication	Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022
Editors	Dimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser
Publisher	European Language Resources Association (ELRA)
Pages	9-16
Number of pages	8
ISBN (Electronic)	9791095546771
Publication status	Published - 1 Jan 2022
Event	4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments - Marseille, France Duration: 25 Jun 2022 → 25 Jun 2022 Conference number: 4

Workshop

Workshop	4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments
Abbreviated title	RAPID 2022
Country/Territory	France
City	Marseille
Period	25/06/22 → 25/06/22

Keywords

Automatic Speech Recognition
Clinical Speech
Data Augmentation
Machine Learning
Mild Cognitive Impairment
Synthetic Data

Cite this

Lindsay, H., Tröger, J., Mina, M., Linz, N., Müller, P., Alexandersson, J., & Ramakers, I. (2022). Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error. In D. Kokkinakis, C. K. Themistocleous, K. L. Fors, A. Tsanas, & K. C. Fraser (Eds.), Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022 (pp. 9-16). European Language Resources Association (ELRA).

Lindsay, Hali ; Tröger, Johannes ; Mina, Mario et al. / Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error. Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022. editor / Dimitrios Kokkinakis ; Charalambos K. Themistocleous ; Kristina Lundholm Fors ; Athanasios Tsanas ; Kathleen C. Fraser. European Language Resources Association (ELRA), 2022. pp. 9-16

@inproceedings{94d2527c7a8d48ca903f98c0fd0d11d1,

title = "Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error",

abstract = "Training classification models on clinical speech is a time-saving and effective solution for many healthcare challenges, such as screening for Alzheimer{\textquoteright}s Disease over the phone. One of the primary limiting factors of the success of artificial intelligence (AI) solutions is the amount of relevant data available. Clinical data is expensive to collect, not sufficient for large-scale machine learning or neural methods, and often not shareable between institutions due to data protection laws. With the increasing demand for AI in health systems, generating synthetic clinical data that maintains the nuance of underlying patient pathology is the next pressing task. Previous work has shown that automated evaluation of clinical speech tasks via automatic speech recognition (ASR) is comparable to manually annotated results in diagnostic scenarios even though ASR systems produce errors during the transcription process. In this work, we propose to generate synthetic clinical data by simulating ASR deletion errors on the transcript to produce additional data. We compare the synthetic data to the real data with traditional machine learning methods to test the feasibility of the proposed method. Using a dataset of 50 cognitively impaired and 50 control Dutch speakers, ten additional data points are synthetically generated for each subject, increasing the training size for 100 to 1000 training points. We find consistent and comparable performance of models trained on only synthetic data (AUC=0.77) to real data (AUC=0.77) in a variety of traditional machine learning scenarios. Additionally, linear models are not able to distinguish between real and synthetic data.",

keywords = "Automatic Speech Recognition, Clinical Speech, Data Augmentation, Machine Learning, Mild Cognitive Impairment, Synthetic Data",

author = "Hali Lindsay and Johannes Tr{\"o}ger and Mario Mina and Nicklas Linz and Philipp M{\"u}ller and Jan Alexandersson and Inez Ramakers",

note = "Funding Information: This research was funded by MEPHESTO project Q10 (BMBF Grant Number 01IS20075). Funding Information: Pakhomov, S. and Hemmy, L. (2014). A computa-tional linguistic measure of clustering behavior on semantic verbal fluency task predicts risk of fu-ture dementia in the nun study. Cortex, 55(1):97– 106, June. Funding Information: The work on this study was supported in part by the National Insti-tutes of Health National Library of Medicine Grant [ LM00962301 – S.P.] and the Nun Study data col-lection was supported by a grant from the National Institute of Aging ( R01AG09862 ). The authors also wish to thank Heather Hoecker for helping with dig-itization of the SVF samples. Publisher Copyright: {\textcopyright} European Language Resources Association (ELRA); 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, RAPID 2022 ; Conference date: 25-06-2022 Through 25-06-2022",

year = "2022",

month = jan,

day = "1",

language = "English",

pages = "9--16",

editor = "Dimitrios Kokkinakis and Themistocleous, {Charalambos K.} and Fors, {Kristina Lundholm} and Athanasios Tsanas and Fraser, {Kathleen C.}",

booktitle = "Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022",

publisher = "European Language Resources Association (ELRA)",

address = "Luxembourg",

}

Lindsay, H, Tröger, J, Mina, M, Linz, N, Müller, P, Alexandersson, J & Ramakers, I 2022, Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error. in D Kokkinakis, CK Themistocleous, KL Fors, A Tsanas & KC Fraser (eds), Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022. European Language Resources Association (ELRA), pp. 9-16, 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, Marseille, France, 25/06/22.

Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error. / Lindsay, Hali; Tröger, Johannes; Mina, Mario et al.
Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022. ed. / Dimitrios Kokkinakis; Charalambos K. Themistocleous; Kristina Lundholm Fors; Athanasios Tsanas; Kathleen C. Fraser. European Language Resources Association (ELRA), 2022. p. 9-16.

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error

AU - Lindsay, Hali

AU - Tröger, Johannes

AU - Mina, Mario

AU - Linz, Nicklas

AU - Müller, Philipp

AU - Alexandersson, Jan

AU - Ramakers, Inez

N1 - Conference code: 4

PY - 2022/1/1

Y1 - 2022/1/1

N2 - Training classification models on clinical speech is a time-saving and effective solution for many healthcare challenges, such as screening for Alzheimer’s Disease over the phone. One of the primary limiting factors of the success of artificial intelligence (AI) solutions is the amount of relevant data available. Clinical data is expensive to collect, not sufficient for large-scale machine learning or neural methods, and often not shareable between institutions due to data protection laws. With the increasing demand for AI in health systems, generating synthetic clinical data that maintains the nuance of underlying patient pathology is the next pressing task. Previous work has shown that automated evaluation of clinical speech tasks via automatic speech recognition (ASR) is comparable to manually annotated results in diagnostic scenarios even though ASR systems produce errors during the transcription process. In this work, we propose to generate synthetic clinical data by simulating ASR deletion errors on the transcript to produce additional data. We compare the synthetic data to the real data with traditional machine learning methods to test the feasibility of the proposed method. Using a dataset of 50 cognitively impaired and 50 control Dutch speakers, ten additional data points are synthetically generated for each subject, increasing the training size for 100 to 1000 training points. We find consistent and comparable performance of models trained on only synthetic data (AUC=0.77) to real data (AUC=0.77) in a variety of traditional machine learning scenarios. Additionally, linear models are not able to distinguish between real and synthetic data.

AB - Training classification models on clinical speech is a time-saving and effective solution for many healthcare challenges, such as screening for Alzheimer’s Disease over the phone. One of the primary limiting factors of the success of artificial intelligence (AI) solutions is the amount of relevant data available. Clinical data is expensive to collect, not sufficient for large-scale machine learning or neural methods, and often not shareable between institutions due to data protection laws. With the increasing demand for AI in health systems, generating synthetic clinical data that maintains the nuance of underlying patient pathology is the next pressing task. Previous work has shown that automated evaluation of clinical speech tasks via automatic speech recognition (ASR) is comparable to manually annotated results in diagnostic scenarios even though ASR systems produce errors during the transcription process. In this work, we propose to generate synthetic clinical data by simulating ASR deletion errors on the transcript to produce additional data. We compare the synthetic data to the real data with traditional machine learning methods to test the feasibility of the proposed method. Using a dataset of 50 cognitively impaired and 50 control Dutch speakers, ten additional data points are synthetically generated for each subject, increasing the training size for 100 to 1000 training points. We find consistent and comparable performance of models trained on only synthetic data (AUC=0.77) to real data (AUC=0.77) in a variety of traditional machine learning scenarios. Additionally, linear models are not able to distinguish between real and synthetic data.

KW - Automatic Speech Recognition

KW - Clinical Speech

KW - Data Augmentation

KW - Machine Learning

KW - Mild Cognitive Impairment

KW - Synthetic Data

UR - http://www.scopus.com/inward/record.url?scp=85145879191&partnerID=8YFLogxK

M3 - Conference article in proceeding

SP - 9

EP - 16

BT - Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022

A2 - Kokkinakis, Dimitrios

A2 - Themistocleous, Charalambos K.

A2 - Fors, Kristina Lundholm

A2 - Tsanas, Athanasios

A2 - Fraser, Kathleen C.

PB - European Language Resources Association (ELRA)

T2 - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments

Y2 - 25 June 2022 through 25 June 2022

ER -

Lindsay H, Tröger J, Mina M, Linz N, Müller P, Alexandersson J et al. Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error. In Kokkinakis D, Themistocleous CK, Fors KL, Tsanas A, Fraser KC, editors, Proceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022. European Language Resources Association (ELRA). 2022. p. 9-16

Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error

Abstract

Workshop

Keywords

Other files and links

Cite this