Item

PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association

Hannan, Abdul
Manzoor, Muhammad Arslan
Nawaz, Shah
Liaqat, Muhammad Irzam
Schedl, Markus
Noman, Mubashir
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
We study the task of learning association between faces and voices, which is gaining interest in the multimodal community lately. These methods suffer from the deliberate crafting of negative mining procedures as well as the reliance on the distant margin parameter. These issues are addressed by learning a joint embedding space in which orthogonality constraints are applied to the fused embeddings of faces and voices. However, embedding spaces of faces and voices possess different characteristics and require spaces to be aligned before fusing them. To this end, we propose a method that accurately aligns the embedding spaces and fuses them with an enhanced gated fusion thereby improving the performance of face-voice association. Extensive experiments on the VoxCeleb dataset reveals the merits of the proposed approach.
Citation
A. Hannan, M. A. Manzoor, S. Nawaz, M. I. Liaqat, M. Schedl, and M. Noman, “PAEFF: Precise Alignment and Enhanced Gated Feature Fusion for Face-Voice Association,” Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 2710–2714, 2025, doi: 10.21437/INTERSPEECH.2025-268
Source
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Conference
26th Interspeech Conference 2025
Keywords
Cross-modal verification & matching, Face-voice association, Hyperbolic Space, Multimodal learning
Subjects
Source
26th Interspeech Conference 2025
Publisher
International Speech Communication Association
Full-text link