Temporal conditional Wasserstein GANs for audio-visual affect-related ties

Christos Athanasiadis; Enrique Hortal; Stelios Asteriadis

doi:10.1109/ACIIW52867.2021.9666277

Temporal conditional Wasserstein GANs for audio-visual affect-related ties

Christos Athanasiadis^*, Enrique Hortal, Stelios Asteriadis

^*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

70 Downloads (Pure)

Abstract

Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.

Original language	English
Title of host publication	2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)
Pages	1-8
Number of pages	8
DOIs	https://doi.org/10.1109/ACIIW52867.2021.9666277
Publication status	Published - 2021
Event	2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos - Nara, Japan Duration: 28 Sept 2021 → 1 Oct 2021 Conference number: 29

Conference

Conference	2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos
Abbreviated title	ACIIW 2021
Country/Territory	Japan
City	Nara
Period	28/09/21 → 1/10/21

Keywords

Domain Adaptation
Audio Emotion Recognition
Generative Adversarial Networks
Attention Mechanisms

Access to Document

10.1109/ACIIW52867.2021.9666277

Full text Final published version, 1.81 MBLicence: Taverne

Cite this

@inproceedings{e2abfac10ad64820bb1c87a6bc1873de,

title = "Temporal conditional Wasserstein GANs for audio-visual affect-related ties",

abstract = "Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.",

keywords = "Domain Adaptation, Audio Emotion Recognition, Generative Adversarial Networks, Attention Mechanisms",

author = "Christos Athanasiadis and Enrique Hortal and Stelios Asteriadis",

year = "2021",

doi = "10.1109/ACIIW52867.2021.9666277",

language = "English",

pages = "1--8",

booktitle = "2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)",

note = "2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2021 ; Conference date: 28-09-2021 Through 01-10-2021",

}

Athanasiadis, C, Hortal, E & Asteriadis, S 2021, Temporal conditional Wasserstein GANs for audio-visual affect-related ties. in 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). pp. 1-8, 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, Nara, Japan, 28/09/21. https://doi.org/10.1109/ACIIW52867.2021.9666277

Temporal conditional Wasserstein GANs for audio-visual affect-related ties. / Athanasiadis, Christos; Hortal, Enrique ; Asteriadis, Stelios.
2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). 2021. p. 1-8.

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - Temporal conditional Wasserstein GANs for audio-visual affect-related ties

AU - Athanasiadis, Christos

AU - Hortal, Enrique

AU - Asteriadis, Stelios

N1 - Conference code: 29

PY - 2021

Y1 - 2021

N2 - Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.

AB - Emotion recognition through audio is a rather challenging task that entails proper feature extraction and classification. Meanwhile, state-of-the-art classification strategies are usually based on deep learning architectures. Training complex deep learning networks normally requires very large audiovisual corpora with available emotion annotations. However, such availability is not always guaranteed since harvesting and annotating such datasets is a time-consuming task. In this work, temporal conditional Wasserstein Generative Adversarial Networks (tc-wGANs) are introduced to generate robust audio data by leveraging information from a face modality. Having as input temporal facial features extracted using a dynamic deep learning architecture (based on 3dCNN, LSTM and Transformer networks) and, additionally, conditional information related to annotations, our system manages to generate realistic spectrograms that represent audio clips corresponding to specific emotional context. As proof of their validity, apart from three quality metrics (Frechet Inception Distance, Inception Score and Structural Similarity index), we verified the generated samples applying an audio-based emotion recognition schema. When the generated samples are fused with the initial real ones, an improvement between 3.5 to 5.5% was achieved in audio emotion recognition performance for two state-of-the-art datasets.

KW - Domain Adaptation

KW - Audio Emotion Recognition

KW - Generative Adversarial Networks

KW - Attention Mechanisms

U2 - 10.1109/ACIIW52867.2021.9666277

DO - 10.1109/ACIIW52867.2021.9666277

M3 - Conference article in proceeding

SP - 1

EP - 8

BT - 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)

T2 - 2021 9th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos

Y2 - 28 September 2021 through 1 October 2021

ER -