Item

Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices

Khan, Mustaqeem
Ahmad, Jamil
Gueaieb, Wail
De Masi, Giulia
Karray, Fakhri
El Saddik, Abdulmotaleb
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The field of Multimodal Emotion Recognition (MER) has made considerable advancements in recent years; however, the opportunity to leverage the synergistic relationships between different modalities remains largely untapped. This paper introduces an MER approach employing a Joint Multi-Scale Multimodal Transformer (JMMT) with recursive cross-attention for naturalistic recognition of emotions by enhancing and capturing inter-and intra-modal relationships across both (visual and audio) modalities. We compute multi-scale attention weights based on cross-correlations between multi-scale joint representations of combined and individual cues to capture inter and intra-modal dynamics. As a result of individual modalities, recursive inputs are fed back during the fusion for further refinement of features. Our JMMT model presents a cost-effective solution for consumer devices by capturing synergistic characteristics across visual and audio inputs. The JMMT model outperforms the state-of-the-art (SOTA) methods in MER systems, which were evaluated by IEMOCAP and MELD datasets.
Citation
M. Khan, J. Ahmad, W. Gueaieb, G. D. Masi, F. Karray and A. E. Saddik, "Joint Multi-Scale Multimodal Transformer for Emotion Using Consumer Devices," in IEEE Transactions on Consumer Electronics, doi: 10.1109/TCE.2025.3532322.
Source
IEEE Transactions on Consumer Electronics
Conference
Keywords
Emotion recognition, Visualization, Transformers, Feature extraction, Face recognition, Residual neural networks, Data mining, Consumer electronics, Spectrogram, Vehicle dynamics
Subjects
Source
Publisher
IEEE
Full-text link