Item

Uncertainty-Aware Audio-Visual Segmentation With Dynamic Fusion for Multimodal Alignment

Lv, Ying
Liu, Zhi
Chang, Xiaojun
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Audio-visual segmentation (AVS) aims to achieve precise object segmentation by leveraging multimodal cues. However, effective alignment and fusion of audio and visual features are often hindered by inherent uncertainty within multimodal data, such as data quality inconsistencies, semantic mismatches, and temporal or spatial misalignments. To address these challenges, we propose an Uncertainty-aware Audio-Visual Segmentation (UAVS) that dynamically handles uncertainty to improve segmentation accuracy and robustness. Our method employs CLIP-generated text embeddings to provide semantic cues of categories for audio features, reducing ambiguity in multimodal alignment. We then introduce a Mixture of Experts (MoE) model, mapping multimodal embedding samples to multi-dimensional Gaussian distributions to quantify uncertainty through variance and modeling feature confidence using the Gaussian probability density function, effectively capturing noise and semantic discrepancies across modalities. In addition, we design a dynamic path algorithm based on uncertainty, enabling the model to adaptively route samples to experts with high confidence. This algorithm enhances performance in complex, noisy, and ambiguous scenes. Extensive experiments conducted on three subsets of the AVSBench benchmark dataset demonstrate that our proposed method achieves competitive performance.
Citation
Y. Lv, Z. Liu, X. Chang, "Uncertainty-Aware Audio-Visual Segmentation With Dynamic Fusion for Multimodal Alignment," IEEE Transactions on Multimedia, vol. PP, no. 99, pp. 1-14, 2026, https://doi.org/10.1109/tmm.2026.3651123.
Source
IEEE Transactions on Multimedia
Conference
Keywords
40 Engineering, 46 Information and Computing Sciences, 4603 Computer Vision and Multimedia Computation
Subjects
Source
Publisher
IEEE
Full-text link