Item

SELongVLM: Empowering Long Video Language Models with Self-Corrective Clip Selection

Zhang, Kecheng
Yang, Zongxin
Han, Mingfei
Zhuge, Yunzhi
Hao, Haihong
Li, Changlin
Li, Zhihui
Chang, Xiaojun
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Recent advances in multimodal large language models (MLLMs) have enabled impressive progress in visual-language reasoning, yet long-video understanding remains a formidable challenge due to the need for coherent reasoning over ultra-long spatiotemporal dependencies. Existing methods struggle with the vast candidate space for relevant information in long videos, often failing to distinguish meaningful events from redundant content. We identify two critical and previously under-explored issues: absolute redundancy, where static visual content inflates token counts without adding narrative value, and relative redundancy, where task-irrelevant segments introduce noise that impairs reasoning. Compounding these issues is the weak spatiotemporal modeling in current MLLMs, which limits their ability to capture complex event dynamics. To address these multifaceted challenges, we introduce SELongVLM, a dynamically lenient-to-stringent selection long video language model. SELongVLM integrates two coordinated branches: a Residual Token Pruner (RTP) that removes repetitive background tokens via inter-frame residual modeling thus mitigating absolute redundancy while preserving motion cues, and a Semantic-aware Self-Correction Selector (SCSelector) that progressively refines query-relevant clip selection without frame-level annotations to reduce relative redundancy, guided by a stringent-to-lenient self-correcting mechanism during optimization. To ensure causal continuity and bolster spatiotemporal reasoning across disjoint clips, the framework further incorporates an action-aware operation for intra-clip dynamics and a temporal memory for cross-clip context, enabling robust spatiotemporal inference on long videos. Extensive experiments across eight benchmarks demonstrate that SELongVLM markedly outperforms existing models on both general and specialized long-video tasks. Specifically, it achieves 65.5% on VideoMME and 69.8% on MLVU for general benchmarks, and delivers strong performance on four specialized benchmarks - for example, 39.2% on TOMATO for fine-grained temporal reasoning and 69.2% on EventBench for event-level understanding.
Citation
K. Zhang, Z. Yang, M. Han, Y. Zhuge, H. Hao, C. Li , et al., "SELongVLM: Empowering Long Video Language Models with Self-Corrective Clip Selection," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. PP, no. 99, pp. 1-16, 2026, https://doi.org/10.1109/tpami.2026.3673141.
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
46 Information and Computing Sciences, 4603 Computer Vision and Multimedia Computation
Subjects
Source
Publisher
IEEE
Full-text link