Generating Synthetic Clinical Speech Data Through Simulated ASR Deletion Error

Hali Lindsay, Johannes Tröger, Mario Mina, Nicklas Linz, Philipp Müller, Jan Alexandersson, Inez Ramakers

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review

Abstract

Training classification models on clinical speech is a time-saving and effective solution for many healthcare challenges, such as screening for Alzheimer’s Disease over the phone. One of the primary limiting factors of the success of artificial intelligence (AI) solutions is the amount of relevant data available. Clinical data is expensive to collect, not sufficient for large-scale machine learning or neural methods, and often not shareable between institutions due to data protection laws. With the increasing demand for AI in health systems, generating synthetic clinical data that maintains the nuance of underlying patient pathology is the next pressing task. Previous work has shown that automated evaluation of clinical speech tasks via automatic speech recognition (ASR) is comparable to manually annotated results in diagnostic scenarios even though ASR systems produce errors during the transcription process. In this work, we propose to generate synthetic clinical data by simulating ASR deletion errors on the transcript to produce additional data. We compare the synthetic data to the real data with traditional machine learning methods to test the feasibility of the proposed method. Using a dataset of 50 cognitively impaired and 50 control Dutch speakers, ten additional data points are synthetically generated for each subject, increasing the training size for 100 to 1000 training points. We find consistent and comparable performance of models trained on only synthetic data (AUC=0.77) to real data (AUC=0.77) in a variety of traditional machine learning scenarios. Additionally, linear models are not able to distinguish between real and synthetic data.
Original languageEnglish
Title of host publicationProceedings - 4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments, as part of the 13th Edition of the Language Resources and Evaluation Conference, LREC 2022
EditorsDimitrios Kokkinakis, Charalambos K. Themistocleous, Kristina Lundholm Fors, Athanasios Tsanas, Kathleen C. Fraser
PublisherEuropean Language Resources Association (ELRA)
Pages9-16
Number of pages8
ISBN (Electronic)9791095546771
Publication statusPublished - 1 Jan 2022
Event4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments - Marseille, France
Duration: 25 Jun 202225 Jun 2022
Conference number: 4

Workshop

Workshop4th RaPID Workshop: Resources and Processing of Linguistic, Para-Linguistic and Extra-Linguistic Data from People with Various Forms of Cognitive/Psychiatric/Developmental Impairments
Abbreviated titleRAPID 2022
Country/TerritoryFrance
CityMarseille
Period25/06/2225/06/22

Keywords

  • Automatic Speech Recognition
  • Clinical Speech
  • Data Augmentation
  • Machine Learning
  • Mild Cognitive Impairment
  • Synthetic Data

Cite this