Item

CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation

Chen, Yuanhong
Wang, Chong
Liu, Yuyuan
Wang, Hu
Carneiro, Gustavo
Citations
Google Scholar:
Altmetric:
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Audio-visual segmentation (AVS) is an emerging task that aims to accurately segment sounding objects based on audio-visual cues. The success of AVS learning systems depends on the effectiveness of cross-modal interaction. Such a requirement can be naturally fulfilled by leveraging transformer-based segmentation architecture due to its inherent ability to capture long-range dependencies and flexibility in handling different modalities. However, the inherent training issues of transformer-based methods, such as the low efficacy of cross-attention and unstable bipartite matching, can be amplified in AVS, particularly when the learned audio query does not provide a clear semantic clue. In this paper, we address these two issues with the new Class-conditional Prompting Machine (CPM). CPM improves the bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries. The efficacy of cross-modal attention is upgraded with new learning objectives for the audio, visual and joint modalities. We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy (This project is supported by the Australian Research Council (ARC) through grant FT190100525.)
Citation
Y. Chen, C. Wang, Y. Liu, H. Wang, and G. Carneiro, “CPM: Class-Conditional Prompting Machine for Audio-Visual Segmentation,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , vol. 15068 LNCS, pp. 438–456, 2025, doi: 10.1007/978-3-031-72684-2_25.
Source
Computer Vision – ECCV 2024
Conference
Keywords
Audio-visual Learning, Multi-modal Learning, Segmentation
Subjects
Source
Publisher
Springer Nature
Full-text link