Metric Learning-Based Multimodal Audio-Visual Emotion Recognition

Esam Ghaleb*, Mirela Popa, Stylianos Asteriadis

*Corresponding author for this work

Research output: Contribution to journalArticleAcademicpeer-review

15 Citations (Web of Science)


People express their emotions through multiple channels, such as visual and audio ones. Consequently, automatic emotion recognition can be significantly benefited by multimodal learning. Even-though each modality exhibits unique characteristics; multimodal learning takes advantage of the complementary information of diverse modalities when measuring the same instance, resulting in enhanced understanding of emotions. Yet, their dependencies and relations are not fully exploited in audio–video emotion recognition. Furthermore, learning an effective metric through multimodality is a crucial goal for many applications in machine learning. Therefore, in this article, we propose multimodal emotion recognition metric learning (MERML), learned jointly to obtain a discriminative score and a robust representation in a latent-space for both modalities. The learned metric is efficiently used through the radial basis function (RBF) based support vector machine (SVM) kernel. The evaluation of our framework shows a significant performance, improving the state-of-the-art results on the eNTERFACE and CREMA-D datasets.
Original languageEnglish
Pages (from-to)37-48
Number of pages12
JournalIeee Multimedia
Issue number1
Publication statusPublished - 2020

Cite this