Item

Towards Open-Vocabulary Audio-Visual Event Localization

Zhou, Jinxing
Guo, Dan
Guo, Ruohao
Mao, Yuxin
Hu, Jingjing
Zhong, Yiran
Chang, Xiaojun
Wang, Meng
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
The Audio-Visual Event Localization (AVEL) task aims to temporally locate and classify video events that are both audible and visible. Most research in this field assumes a closed-set setting, which restricts these models' ability to handle test data containing event categories absent (unseen) during training. Recently, a few studies have explored AVEL in an open-set setting, enabling the recognition of unseen events as "unknown", but without providing category-specific semantics. In this paper, we advance the field by introducing the Open-Vocabulary Audio-Visual Event Localization (OV-AVEL) problem, which requires localizing audio-visual events and predicting explicit categories for both seen and unseen test data at inference. To address this new task, we propose the OV-AVEBench dataset, comprising 24,800 videos across 67 real-life audiovisual scenes (seen:unseen = 46:21), each with manual segment-level annotation. We also establish three evaluation metrics for this task. Moreover, we investigate two baseline approaches, one training-free and one using a further fine-tuning paradigm. Specifically, we utilize the unified multimodal space from the pretrained ImageBind model to extract audio, visual, and textual (event classes) features. The training-free baseline then determines predictions by comparing the consistency of audio-text and visual-text feature similarities. The fine-tuning baseline incorporates lightweight temporal layers to encode temporal relations within the audio and visual modalities, using OVAVEBench training data for model fine-tuning. We evaluate these baselines on the proposed OV-AVEBench dataset and discuss potential directions for future work in this new field.
Citation
J. Zhou et al., "Towards Open-Vocabulary Audio-Visual Event Localization," 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2025, pp. 8362-8371, doi: 10.1109/CVPR52734.2025.00783.
Source
Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Conference
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Keywords
Subjects
Source
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025
Publisher
IEEE
Full-text link