Metric Learning-Based Multimodal Audio-Visual Emotion Recognition

Esam Ghaleb; Mirela Popa; Stylianos Asteriadis

doi:10.1109/MMUL.2019.2960219

Metric Learning-Based Multimodal Audio-Visual Emotion Recognition

Esam Ghaleb^*, Mirela Popa, Stylianos Asteriadis

^*Corresponding author for this work

Research output: Contribution to journal › Article › Academic › peer-review

Abstract

People express their emotions through multiple channels, such as visual and audio ones. Consequently, automatic emotion recognition can be significantly benefited by multimodal learning. Even-though each modality exhibits unique characteristics; multimodal learning takes advantage of the complementary information of diverse modalities when measuring the same instance, resulting in enhanced understanding of emotions. Yet, their dependencies and relations are not fully exploited in audio–video emotion recognition. Furthermore, learning an effective metric through multimodality is a crucial goal for many applications in machine learning. Therefore, in this article, we propose multimodal emotion recognition metric learning (MERML), learned jointly to obtain a discriminative score and a robust representation in a latent-space for both modalities. The learned metric is efficiently used through the radial basis function (RBF) based support vector machine (SVM) kernel. The evaluation of our framework shows a significant performance, improving the state-of-the-art results on the eNTERFACE and CREMA-D datasets.

Original language	English
Pages (from-to)	37-48
Number of pages	12
Journal	Ieee Multimedia
Volume	27
Issue number	1
DOIs	https://doi.org/10.1109/MMUL.2019.2960219
Publication status	Published - 2020

Access to Document

10.1109/MMUL.2019.2960219

https://project.dke.maastrichtuniversity.nl/RAI/wp-content/uploads/2019/12/Revised_MERML_IEEE_Multimedia_second_revisions.pdf

Cite this

@article{f5887ac0dd5b482abea1a4c24d3cd2ac,

title = "Metric Learning-Based Multimodal Audio-Visual Emotion Recognition",

abstract = "People express their emotions through multiple channels, such as visual and audio ones. Consequently, automatic emotion recognition can be significantly benefited by multimodal learning. Even-though each modality exhibits unique characteristics; multimodal learning takes advantage of the complementary information of diverse modalities when measuring the same instance, resulting in enhanced understanding of emotions. Yet, their dependencies and relations are not fully exploited in audio–video emotion recognition. Furthermore, learning an effective metric through multimodality is a crucial goal for many applications in machine learning. Therefore, in this article, we propose multimodal emotion recognition metric learning (MERML), learned jointly to obtain a discriminative score and a robust representation in a latent-space for both modalities. The learned metric is efficiently used through the radial basis function (RBF) based support vector machine (SVM) kernel. The evaluation of our framework shows a significant performance, improving the state-of-the-art results on the eNTERFACE and CREMA-D datasets.",

author = "Esam Ghaleb and Mirela Popa and Stylianos Asteriadis",

note = "Funding Information: This article was supported by the H2020-MaTHiSiS project under Grant 687772. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Nvidia TITAN XP GPUs used for this research. Publisher Copyright: {\textcopyright} 1994-2012 IEEE.",

year = "2020",

doi = "10.1109/MMUL.2019.2960219",

language = "English",

volume = "27",

pages = "37--48",

journal = "Ieee Multimedia",

issn = "1070-986X",

publisher = "IEEE Computer Society",

number = "1",

}

TY - JOUR

T1 - Metric Learning-Based Multimodal Audio-Visual Emotion Recognition

AU - Ghaleb, Esam

AU - Popa, Mirela

AU - Asteriadis, Stylianos

N1 - Funding Information: This article was supported by the H2020-MaTHiSiS project under Grant 687772. The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Nvidia TITAN XP GPUs used for this research. Publisher Copyright: © 1994-2012 IEEE.

PY - 2020

Y1 - 2020

N2 - People express their emotions through multiple channels, such as visual and audio ones. Consequently, automatic emotion recognition can be significantly benefited by multimodal learning. Even-though each modality exhibits unique characteristics; multimodal learning takes advantage of the complementary information of diverse modalities when measuring the same instance, resulting in enhanced understanding of emotions. Yet, their dependencies and relations are not fully exploited in audio–video emotion recognition. Furthermore, learning an effective metric through multimodality is a crucial goal for many applications in machine learning. Therefore, in this article, we propose multimodal emotion recognition metric learning (MERML), learned jointly to obtain a discriminative score and a robust representation in a latent-space for both modalities. The learned metric is efficiently used through the radial basis function (RBF) based support vector machine (SVM) kernel. The evaluation of our framework shows a significant performance, improving the state-of-the-art results on the eNTERFACE and CREMA-D datasets.

AB - People express their emotions through multiple channels, such as visual and audio ones. Consequently, automatic emotion recognition can be significantly benefited by multimodal learning. Even-though each modality exhibits unique characteristics; multimodal learning takes advantage of the complementary information of diverse modalities when measuring the same instance, resulting in enhanced understanding of emotions. Yet, their dependencies and relations are not fully exploited in audio–video emotion recognition. Furthermore, learning an effective metric through multimodality is a crucial goal for many applications in machine learning. Therefore, in this article, we propose multimodal emotion recognition metric learning (MERML), learned jointly to obtain a discriminative score and a robust representation in a latent-space for both modalities. The learned metric is efficiently used through the radial basis function (RBF) based support vector machine (SVM) kernel. The evaluation of our framework shows a significant performance, improving the state-of-the-art results on the eNTERFACE and CREMA-D datasets.

U2 - 10.1109/MMUL.2019.2960219

DO - 10.1109/MMUL.2019.2960219

M3 - Article

SN - 1070-986X

VL - 27

SP - 37

EP - 48

JO - Ieee Multimedia

JF - Ieee Multimedia

IS - 1

ER -