Audio–visual domain adaptation using conditional semi-supervised Generative Adversarial Networks

Christos Athanasiadis*, Enrique Hortal, Stelios Asteriadis

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

10 Citations (Web of Science)

Abstract

Accessing large, manually annotated audio databases in an effort to create robust models for emotion recognition is a notably difficult task, handicapped by the annotation cost and label ambiguities. On the contrary, there are plenty of publicly available datasets for emotion recognition which are based on facial expressivity due to the prevailing role of computer vision in deep learning research, nowadays. Thereby, in the current work, we performed a study on cross-modal transfer knowledge between audio and facial modalities within the emotional context. More concretely, we investigated whether facial information from videos could be used to boost the awareness and the prediction tracking of emotions in audio signals. Our approach was based on a simple hypothesis: that the emotional state's content of a person's oral expression correlates with the corresponding facial expressions. Research in the domain of cognitive psychology was affirmative to our hypothesis and suggests that visual information related to emotions fused with the auditory signal is used from humans in a cross-modal integration schema to better understand emotions. In this regard, a method called dacssGAN (which stands for Domain Adaptation Conditional Semi-Supervised Generative Adversarial Networks) is introduced in this work, in an effort to bridge these two inherently different domains. Given as input the source domain (visual data) and some conditional information that is based on inductive conformal prediction, the proposed architecture generates data distributions that are as close as possible to the target domain (audio data). Through experimentation, it is shown that classification performance of an expanded dataset using real audio enhanced with generated samples produced using dacssGAN (50.29% and 48.65%) outperforms the one obtained merely using real audio samples (49.34% and 46.90%) for two publicly available audio-visual emotion datasets. (C) 2019 The Authors. Published by Elsevier B.V.

Original languageEnglish
Pages (from-to)331-344
Number of pages14
JournalNeurocomputing
Volume397
Early online date29 Nov 2019
DOIs
Publication statusPublished - 15 Jul 2020

Keywords

  • Domain adaptation
  • Conformal prediction
  • Generative adversarial
  • Networks
  • FACIAL EXPRESSION RECOGNITION

Cite this