Multimodal Attention-Mechanism For Temporal Emotion Recognition

Esam Ghaleb, Jan Niehues, Stelios Asteriadis

Research output: Chapter in Book/Report/Conference proceedingConference article in proceedingAcademicpeer-review


Exploiting the multimodal and temporal interaction between audio-visual channels is essential for automatic audio-video emotion recognition (AVER). Modalities' strength in emotions and time-window of a video-clip could be further utilized through a weighting scheme such as attention mechanism to capture their complementary information. The attention mechanism is a powerful approach for sequence modeling, which can be employed to fuse audio-video cues overtime. We propose a novel framework which consists of biaudio-visual time-windows that span short video-clips labeled with discrete emotions. Attention is used to weigh these time windows for multimodal learning and fusion. Experimental results on two datasets show that the proposed methodology can achieve an enhanced multimodal emotion recognition.

Original languageEnglish
Title of host publication2020 IEEE International Conference on Image Processing (ICIP)
Number of pages5
ISBN (Print)9781728163956
Publication statusPublished - 25 Oct 2020
Event2020 IEEE International Conference on Image Processing (ICIP) - Abu Dhabi, Abu Dhabi, United Arab Emirates
Duration: 25 Oct 202028 Oct 2020

Publication series

SeriesIEEE International Conference on Image Processing ICIP


Conference2020 IEEE International Conference on Image Processing (ICIP)
Abbreviated titleICIP
Country/TerritoryUnited Arab Emirates
CityAbu Dhabi
Internet address


  • attention
  • multimodal learning
  • audiovisual emotion recognition

Cite this