Loading...
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
Zhou, Ziheng ; Zhou, Jinxing ; Qian, Wei ; Tang, Shengeng ; Chang, Xiaojun ; Guo, Dan
Zhou, Ziheng
Zhou, Jinxing
Qian, Wei
Tang, Shengeng
Chang, Xiaojun
Guo, Dan
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In the field of audio-visual learning, most research tasks focus exclusively on short videos. This paper focuses on the more practical Dense Audio-Visual Event Localization (DAVEL) task, advancing audio-visual scene understanding for longer, untrimmed videos. This task seeks to identify and temporally pinpoint all events simultaneously occurring in both audio and visual streams. Typically, each video encompasses dense events of multiple classes, which may overlap on the timeline, each exhibiting varied durations. Given these challenges, effectively exploiting the audio-visual relations and the temporal features encoded at various granularities becomes crucial. To address these challenges, we introduce a novel CCNet, comprising two core modules: the Cross-Modal Consistency Collaboration (CMCC) and the Multi-Temporal Granularity Collaboration (MTGC). Specifically, the CMCC module contains two branches: a cross-modal interaction branch and a temporal consistency-gated branch. The former branch facilitates the aggregation of consistent event semantics across modalities through the encoding of audio-visual relations, while the latter branch guides one modality's focus to pivotal event-relevant temporal areas as discerned in the other modality. The MTGC module includes a coarse-to-fine collaboration block and a fine-to-coarse collaboration block, providing bidirectional support among coarse- and fine-grained temporal features. Extensive experiments on the UnAV-100 dataset validate our module design, resulting in a new state-of-the-art performance in dense audio-visual event localization. Copyright
Citation
Z. Zhou, J. Zhou, W. Qian, S. Tang, X. Chang, and D. Guo, “Dense Audio-Visual Event Localization Under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 10, pp. 10905–10913, Apr. 2025, doi: 10.1609/AAAI.V39I10.33185.
Source
Proceedings of the AAAI Conference on Artificial Intelligence
Conference
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Keywords
Audio-visual, Cross-modal, Cross-modal interaction, Event localizations, Multi-temporal, Multiple class, Temporal features, Temporal granularity, Visual learning, Visual scene understanding, Modal analysis
Subjects
Source
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Publisher
Association for the Advancement of Artificial Intelligence
