Multi-modal integrated proposal generation network for weakly supervised video moment retrieval
Fang, Dikai ; Xu, Huahu ; Wei, Wei ; Guizani, Mohsen ; Gao, Honghao
Fang, Dikai
Xu, Huahu
Wei, Wei
Guizani, Mohsen
Gao, Honghao
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Video moment retrieval aims to precisely identify and localize a specific segment within an untrimmed video that corresponds to a given natural language query. The manual annotation of temporal boundaries is recognized as both time-consuming and subjective, motivating the exploration of weakly supervised methods. However, prevailing strategies in this field rely primarily on the combination of visual and textual features, overlooking the abundant contextual information provided by scenes, objects, and motions inherent in videos. Additionally, existing approaches often lack effective mechanisms for detecting and utilizing negative proposals. To address these limitations, this paper introduces a Multi-Modal Integrated Proposal Generation Network (MIPGN), a novel framework designed to enhance video moment retrieval. First, the MIPGN uses frame feature clustering to analyze and understand scene distributions in videos, guiding the generation of adaptive proposals that reflect the complexity of the scene. Second, by leveraging existing pretrained models, object and motion tags are extracted from proposals to enrich the multi-modal feature representation. Third, the incorporation of query-tag similarity loss, along with query reconstruction loss, significantly strengthens the model’s discriminative ability within the contrastive learning paradigm. Finally, our proposed method demonstrates superior performance in comprehensive experiments on the Charades-STA and ActivityNet Captions datasets, and detailed ablation studies further emphasize the substantial impact of each component on the overall effectiveness of the framework.
Citation
D. Fang, H. Xu, W. Wei, M. Guizani, and H. Gao, “Multi-modal integrated proposal generation network for weakly supervised video moment retrieval,” Expert Syst Appl, vol. 269, p. 126497, Apr. 2025, doi: 10.1016/J.ESWA.2025.126497.
Source
Expert Systems with Applications
Conference
Keywords
Video moment retrieval, Weakly supervised learning, Multi-modal information, Contrastive learning
Subjects
Source
Publisher
Elsevier
