Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
Zhou, Jinxing ; Zhou, Yanghao ; Han, Mingfei ; Wang, Tong ; Chang, Xiaojun ; Cholakkal, Hisham ; Anwer, Rao Muhammad
Zhou, Jinxing
Zhou, Yanghao
Han, Mingfei
Wang, Tong
Chang, Xiaojun
Cholakkal, Hisham
Anwer, Rao Muhammad
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
License
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Referring Audio-Visual Segmentation (Ref-AVS) aims to segment target objects in audible videos based on given reference expressions. Prior works typically rely on learning latent embeddings via multimodal fusion to prompt a tunable SAM/SAM2 decoder for segmentation, which requires strong pixel-level supervision and lacks interpretability. From a novel perspective of explicit reference understanding, we propose TGS-Agent, which decomposes the task into a Think-Ground-Segment process, mimicking the human reasoning procedure by first identifying the referred object through multimodal analysis, followed by coarse-grained grounding and precise segmentation. To this end, we first propose Ref-Thinker, a multimodal language model capable of reasoning over textual, visual, and auditory cues. We construct an instruction-tuning dataset with explicit object-aware think-answer chains for Ref-Thinker fine-tuning. The object description inferred by Ref-Thinker is used as an explicit prompt for Grounding-DINO and SAM2, which perform grounding and segmentation without relying on pixel-level supervision. Additionally, we introduce R2-AVSBench, a new benchmark with linguistically diverse and reasoning-intensive references for better evaluating model generalization. Our approach achieves state-of-the-art results on both standard Ref-AVSBench and proposed R2-AVSBench.
Citation
J. Zhou, Y. Zhou, M. Han, T. Wang, X. Chang, H. Cholakkal , et al., "Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation," 2026, pp. 13665-13673.
Source
Proceedings of the AAAI Conference on Artificial Intelligence
Conference
AAAI Conference on Artificial Intelligence
Keywords
46 Information and Computing Sciences, 4602 Artificial Intelligence, 4608 Human-Centred Computing
Subjects
Source
AAAI Conference on Artificial Intelligence
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
