Item

Video Corpus Moment Retrieval with Query-specific Context Learning and Progressive Localization

Zhang, Long
Song, Peipei
Duan, Zhangling
Wang, Shuo
Chang, Xiaojun
Yang, Xun
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Video corpus moment retrieval (VCMR) aims to retrieve a moment from a large corpus of untrimmed videos corresponding to a given language query. However, existing methods often fall short due to their reliance on simple cross-modal attention mechanisms and one-stop localization, which fail to handle the complex multimodal information and large search space effectively. To address these challenges, we propose a novel VCMR method with Query-specific Context Learning and Progressive Localization (QCLPL). First, we construct query-specific multimodal contexts that capture complementary and consistent semantics across subtitles and frames, ensuring informative and efficient context building. We further introduce a semantic contrastive loss to refine these multimodal contexts, filtering out query-irrelevant information. Additionally, we introduce a progressive localization strategy that transforms the moment localization task into a two-stage process. By classifying frames into foreground and background regions, we present a simplified binary classification problem before boundary prediction, constrained by a region-aware loss. This progressive approach leverages region priors to improve subsequent moment localization. Extensive experiments on the TVR and DiDeMo datasets demonstrate that our method significantly outperforms existing approaches, setting a new state of the art for VCMR.
Citation
L. Zhang, P. Song, Z. Duan, S. Wang, X. Chang and X. Yang, "Video Corpus Moment Retrieval with Query-specific Context Learning and Progressive Localization," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2025.3530570.
Source
IEEE Transactions on Circuits and Systems for Video Technology
Conference
Keywords
Location awareness, Semantics, Proposals, Contrastive learning, Visualization, H, s, Quantum cascade lasers, Electronic mail, Streams, Streaming media
Subjects
Source
Publisher
IEEE
Full-text link