TY - JOUR
T1 - Audio–visual domain adaptation using conditional semi-supervised Generative Adversarial Networks
AU - Athanasiadis, Christos
AU - Hortal, Enrique
AU - Asteriadis, Stelios
N1 - Funding Information:
This work was supported by the Horizon 2020 funded project MaTHiSiS (Managing Affective-learning THrough Intelligent atoms and Smart InteractionS) nr. 687772 ( http://www.mathisis-project.eu/ ). Furthermore, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Nvidia GeForce GTX TITAN X GPU used throughout the experimental phase. Christos Athanasiadis received his master in Artificial Intelligence and Digital media process from Aristotle University of Thessaloniki, in 2012. From 2013 to 2015, he worked as a research assistant in The Centre for Research and Technology, Hellas (CERTH). He is currently Ph.D. candidate at the Department of Data Science and Knowledge Engineering, University of Maastricht, the Netherlands. His research interests include, data mining, transfer learning, affective computing and computational intelligence on video and serious games. Enrique Hortal is assistant professor at the Department of Data Science and Knowledge Engineering at the University of Maastricht, the Netherlands. He received his Ph.D. in Industrial and Telecommunications Technologies from the Miguel Hernández University of Elche, Spain in 2016. He has a background in bio-signals processing and classification. Enrique has participated in national and international research projects in the field of brain-machine interfaces and related to adaptable e-learning based on affective recognition. His research interests include (bio)signal analysis, human-machine interaction, machine intelligence and cognitive state recognition. Stylianos Asteriadis is Assistant Professor at the Department of Data Science and Knowledge Engineering, at the University of Maastricht, the Netherlands and member of the RAI group. His research interests include Visual Computing, Machine Intelligence, human emotion recognition and human activity recognition with applications in ambient-assisted living environments and smart learning contexts. He received his Ph.D. in 2011, from the National technical University of Athens and, before joining the University of Maastricht, he worked in CERTH (Centre for Research and Technology, Hellas), as a postdoctoral researcher.
Funding Information:
This work was supported by the Horizon 2020 funded project MaTHiSiS (Managing Affective-learning THrough Intelligent atoms and Smart InteractionS) nr. 687772 (http://www.mathisis-project.eu/). Furthermore, we gratefully acknowledge the support of NVIDIA Corporation with the donation of the Nvidia GeForce GTX TITAN X GPU used throughout the experimental phase.
Publisher Copyright:
© 2019
PY - 2020/7/15
Y1 - 2020/7/15
N2 - Accessing large, manually annotated audio databases in an effort to create robust models for emotion recognition is a notably difficult task, handicapped by the annotation cost and label ambiguities. On the contrary, there are plenty of publicly available datasets for emotion recognition which are based on facial expressivity due to the prevailing role of computer vision in deep learning research, nowadays. Thereby, in the current work, we performed a study on cross-modal transfer knowledge between audio and facial modalities within the emotional context. More concretely, we investigated whether facial information from videos could be used to boost the awareness and the prediction tracking of emotions in audio signals. Our approach was based on a simple hypothesis: that the emotional state's content of a person's oral expression correlates with the corresponding facial expressions. Research in the domain of cognitive psychology was affirmative to our hypothesis and suggests that visual information related to emotions fused with the auditory signal is used from humans in a cross-modal integration schema to better understand emotions. In this regard, a method called dacssGAN (which stands for Domain Adaptation Conditional Semi-Supervised Generative Adversarial Networks) is introduced in this work, in an effort to bridge these two inherently different domains. Given as input the source domain (visual data) and some conditional information that is based on inductive conformal prediction, the proposed architecture generates data distributions that are as close as possible to the target domain (audio data). Through experimentation, it is shown that classification performance of an expanded dataset using real audio enhanced with generated samples produced using dacssGAN (50.29% and 48.65%) outperforms the one obtained merely using real audio samples (49.34% and 46.90%) for two publicly available audio-visual emotion datasets. (C) 2019 The Authors. Published by Elsevier B.V.
AB - Accessing large, manually annotated audio databases in an effort to create robust models for emotion recognition is a notably difficult task, handicapped by the annotation cost and label ambiguities. On the contrary, there are plenty of publicly available datasets for emotion recognition which are based on facial expressivity due to the prevailing role of computer vision in deep learning research, nowadays. Thereby, in the current work, we performed a study on cross-modal transfer knowledge between audio and facial modalities within the emotional context. More concretely, we investigated whether facial information from videos could be used to boost the awareness and the prediction tracking of emotions in audio signals. Our approach was based on a simple hypothesis: that the emotional state's content of a person's oral expression correlates with the corresponding facial expressions. Research in the domain of cognitive psychology was affirmative to our hypothesis and suggests that visual information related to emotions fused with the auditory signal is used from humans in a cross-modal integration schema to better understand emotions. In this regard, a method called dacssGAN (which stands for Domain Adaptation Conditional Semi-Supervised Generative Adversarial Networks) is introduced in this work, in an effort to bridge these two inherently different domains. Given as input the source domain (visual data) and some conditional information that is based on inductive conformal prediction, the proposed architecture generates data distributions that are as close as possible to the target domain (audio data). Through experimentation, it is shown that classification performance of an expanded dataset using real audio enhanced with generated samples produced using dacssGAN (50.29% and 48.65%) outperforms the one obtained merely using real audio samples (49.34% and 46.90%) for two publicly available audio-visual emotion datasets. (C) 2019 The Authors. Published by Elsevier B.V.
KW - Domain adaptation
KW - Conformal prediction
KW - Generative adversarial
KW - Networks
KW - FACIAL EXPRESSION RECOGNITION
U2 - 10.1016/j.neucom.2019.09.106
DO - 10.1016/j.neucom.2019.09.106
M3 - Article
SN - 0925-2312
VL - 397
SP - 331
EP - 344
JO - Neurocomputing
JF - Neurocomputing
ER -