Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition

Esam Ghaleb; Mirela Popa; Stelios Asteriadis

doi:10.1109/ACII.2019.8925444

Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition

Esam Ghaleb^*, Mirela Popa, Stelios Asteriadis

^*Corresponding author for this work

Advanced Computing Sciences

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

Abstract

In Audio-Video Emotion Recognition (AVER), the idea is to have a human-level understanding of emotions from video clips. There is a need to bring these two modalities into a unified framework, to effectively learn multimodal fusion for AVER. In addition, literature studies lack in-depth analysis and utilization of how emotions vary as a function of time. Psychological and neurological studies show that negative and positive emotions are not recognized at the same speed. In this paper, we propose a novel multimodal temporal deep network framework that embeds video clips using their audio-visual content, onto a metric space, where their gap is reduced and their complementary and supplementary information is explored. We address two research questions, (1) how audio-visual cues contribute to emotion recognition and (2) how temporal information impacts the recognition rate and speed of emotions. The proposed method is evaluated on two datasets, CREMA-D and RAVDESS. The study findings are promising, achieving the state-of-the-art performance on both datasets, and showing a significant impact of multimodal and temporal emotion perception.

Original language	English
Title of host publication	8th International Conference on Affective Computing & Intelligent Interaction (ACII 2019), Cambridge, United Kingdom
Pages	552-558
Number of pages	7
ISBN (Electronic)	9781728138886
DOIs	https://doi.org/10.1109/ACII.2019.8925444
Publication status	Published - Sept 2019

Keywords

audio-video emotion recognition
deep metric learning
multimodal and incremental learning

Access to Document

10.1109/ACII.2019.8925444

Cite this

@inproceedings{bc4e1d6ee48843d0a18cbeaf8707bfcc,

title = "Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition",

abstract = "In Audio-Video Emotion Recognition (AVER), the idea is to have a human-level understanding of emotions from video clips. There is a need to bring these two modalities into a unified framework, to effectively learn multimodal fusion for AVER. In addition, literature studies lack in-depth analysis and utilization of how emotions vary as a function of time. Psychological and neurological studies show that negative and positive emotions are not recognized at the same speed. In this paper, we propose a novel multimodal temporal deep network framework that embeds video clips using their audio-visual content, onto a metric space, where their gap is reduced and their complementary and supplementary information is explored. We address two research questions, (1) how audio-visual cues contribute to emotion recognition and (2) how temporal information impacts the recognition rate and speed of emotions. The proposed method is evaluated on two datasets, CREMA-D and RAVDESS. The study findings are promising, achieving the state-of-the-art performance on both datasets, and showing a significant impact of multimodal and temporal emotion perception.",

keywords = "audio-video emotion recognition, deep metric learning, multimodal and incremental learning",

author = "Esam Ghaleb and Mirela Popa and Stelios Asteriadis",

note = "Funding Information: This work has been funded through H2020-MaTHiSiS project under Grant Agreement No. 687772. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Nvidia TITAN XP GPUs used for this research. Publisher Copyright: {\textcopyright} 2019 IEEE.",

year = "2019",

month = sep,

doi = "10.1109/ACII.2019.8925444",

language = "English",

pages = "552--558",

booktitle = "8th International Conference on Affective Computing & Intelligent Interaction (ACII 2019), Cambridge, United Kingdom",

}

Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition. / Ghaleb, Esam; Popa, Mirela ; Asteriadis, Stelios.
8th International Conference on Affective Computing & Intelligent Interaction (ACII 2019), Cambridge, United Kingdom. 2019. p. 552-558.

Research output: Chapter in Book/Report/Conference proceeding › Conference article in proceeding › Academic › peer-review

TY - GEN

T1 - Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition

AU - Ghaleb, Esam

AU - Popa, Mirela

AU - Asteriadis, Stelios

N1 - Funding Information: This work has been funded through H2020-MaTHiSiS project under Grant Agreement No. 687772. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Nvidia TITAN XP GPUs used for this research. Publisher Copyright: © 2019 IEEE.

PY - 2019/9

Y1 - 2019/9

N2 - In Audio-Video Emotion Recognition (AVER), the idea is to have a human-level understanding of emotions from video clips. There is a need to bring these two modalities into a unified framework, to effectively learn multimodal fusion for AVER. In addition, literature studies lack in-depth analysis and utilization of how emotions vary as a function of time. Psychological and neurological studies show that negative and positive emotions are not recognized at the same speed. In this paper, we propose a novel multimodal temporal deep network framework that embeds video clips using their audio-visual content, onto a metric space, where their gap is reduced and their complementary and supplementary information is explored. We address two research questions, (1) how audio-visual cues contribute to emotion recognition and (2) how temporal information impacts the recognition rate and speed of emotions. The proposed method is evaluated on two datasets, CREMA-D and RAVDESS. The study findings are promising, achieving the state-of-the-art performance on both datasets, and showing a significant impact of multimodal and temporal emotion perception.

AB - In Audio-Video Emotion Recognition (AVER), the idea is to have a human-level understanding of emotions from video clips. There is a need to bring these two modalities into a unified framework, to effectively learn multimodal fusion for AVER. In addition, literature studies lack in-depth analysis and utilization of how emotions vary as a function of time. Psychological and neurological studies show that negative and positive emotions are not recognized at the same speed. In this paper, we propose a novel multimodal temporal deep network framework that embeds video clips using their audio-visual content, onto a metric space, where their gap is reduced and their complementary and supplementary information is explored. We address two research questions, (1) how audio-visual cues contribute to emotion recognition and (2) how temporal information impacts the recognition rate and speed of emotions. The proposed method is evaluated on two datasets, CREMA-D and RAVDESS. The study findings are promising, achieving the state-of-the-art performance on both datasets, and showing a significant impact of multimodal and temporal emotion perception.

KW - audio-video emotion recognition

KW - deep metric learning

KW - multimodal and incremental learning

U2 - 10.1109/ACII.2019.8925444

DO - 10.1109/ACII.2019.8925444

M3 - Conference article in proceeding

SP - 552

EP - 558

BT - 8th International Conference on Affective Computing & Intelligent Interaction (ACII 2019), Cambridge, United Kingdom

ER -