Consistency-Queried Transformer for Audio-Visual Segmentation
Lv, Ying ; Liu, Zhi ; Chang, Xiaojun
Lv, Ying
Liu, Zhi
Chang, Xiaojun
Author
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Audio-visual segmentation (AVS) aims to segment objects in audio-visual content. The effective interaction between audio and visual features has garnered significant attention from the multimodal domain. Despite significant advancements, most existing AVS methods are hampered by multimodal inconsistencies. These inconsistencies primarily manifest as a mismatch between audio and visual information guided by audio cues, wherein visual features often dominate audio modality. To address this issue, we propose the Consistency-Queried Transformer (CQFormer), a novel framework for AVS tasks that leverages the transformer architecture. This framework features a Consistency Query Generator (CQG) and a Query-Aligned Matching (QAM) module. The Noise Contrastive Estimation (NCE) loss function enhances modality matching and consistency by minimizing the distributional differences between audio and visual features, facilitating effective fusion and interaction between these features. Additionally, introducing the consistency query during the decoding stage enhances consistency constraints and object-level semantic information, further improving the accuracy and stability of audio-visual segmentation. Extensive experiments on the popular benchmark of the audio-visual segmentation dataset demonstrate that the proposed CQFormer achieves state-of-the-art performance.
Citation
Y. Lv, Z. Liu, and X. Chang, “Consistency-Queried Transformer for Audio-Visual Segmentation,” IEEE Transactions on Image Processing, vol. 34, pp. 2616–2627, 2025, doi: 10.1109/TIP.2025.3563076.
Source
IEEE Transactions on Image Processing
Conference
Keywords
Audio-visual segmentation, multimodal segmentation, consistency, aligned matching
Subjects
Source
Publisher
IEEE
