Loading...
On Utilizing Auxiliary Information for Weakly-Supervised Video Anomaly Detection
Almarri, Salem Saqer Majid
Almarri, Salem Saqer Majid
Files
Author
Supervisor
Department
Computer Vision
Embargo End Date
2025-05-30
Type
Dissertation
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Weakly-Supervised Video Anomaly Detection (WS-VAD) is a challenging computer vision task with various security and public safety applications. The end goal in WS-VAD is to learn an anomaly scoring function that quantifies the degree of abnormality in each frame of the given video. However, this task is difficult because of the unavailability of fine-grained frame-level or even segment-level annotations. Given only a coarse binary label at the video-level, it is typically assumed that a video labeled as anomalous contains at least one abnormal segment, while a video labeled as normal contains no abnormal segment. This results in a phenomenon known as segment ambiguity - a scenario where it is unclear which segment(s) of an anomalous video are truly abnormal. In this research, we explore how auxiliary information can be exploited to mitigate the limitations of WSVAD and address segment ambiguity to produce a robustly trained anomaly scoring model. Three forms of auxiliary information have been explored as part of this thesis:
1. Pseudo Event Boundaries: It is critical for any video anomaly detection model to learn the boundaries between normal and anomalous events. Since this information is not available during training in WS-VAD, we propose a self-supervised shuffling mechanism based on a two-state Markov process that shuffles between segments of an abnormal and a normal video to generate “virtual” video sequence. The transition between segments corresponding to the normal and anomalous videos in the virtual video sequence can be considered as pseudo event boundaries. Since these pseudo event boundaries are known a priori, this auxiliary information can be leveraged by encouraging the anomaly detector to learn temporal event boundaries and center points of events as auxiliary tasks in addition to the primary anomaly scoring task. We demonstrate that this multi-task learning approach partially addresses segment ambiguity and enhances robustness to noisy labeling.
2. Textual Event Descriptions: Although it is often difficult to obtain fine-grained frame-level annotations for a video, it is relatively simpler to obtain high-level textual event descriptions for videos (e.g., shoplifting, explosion, fire, etc.). Since the typical goal is to learn a video anomaly detector that is agnostic to specific event types, these textual event descriptions are often ignored in WS-VAD. In this work, we utilize textual event descriptions as auxiliary information and formulate the alignment between visual and textual feature representations within a shared embedding space as an optimal transport problem. The video anomaly detector is built on top of a vision-language model (VLM) and is guided not only by coarse binary labels but also textual prompts that represent a dictionary of abnormal and normal events, allowing the VLM to better recognize anomalies.
3. Audio Modality: In some VAD applications, the audio modality is also available in addition to the visual information. Although several multimodal anomaly detectors have been proposed to exploit the auxiliary audio information, they are usually trained based on the assumption that both the visual and audio modalities are always available. However, in real-world use cases, one or both modalities could be corrupted. We first introduce a modality corruption benchmark to evaluate WS-VAD performance under missing or corrupted modalities. We also learn a novel shared representation space for multi-modal feature embeddings that is robust to modality corruption. This allows VAD model to provide an effective prediction even when one modality is missing or compromised. During inference, we introduce a dynamic weighting scheme that leverages a Gaussian Mixture Model (trained on clean data) to estimate the likelihood of corruption in each modality and adaptively assign importance to the individual modalities, ensuring robustness under degraded inputs.
Extensive experiments on well-known WS-VAD benchmark datasets (e.g. UCF-Crime, XD-Violence, etc.) demonstrate that our novel contributions exploiting the above forms of auxiliary information significantly improve the accuracy of WS-VAD models, paving the path towards a more practical and real-world adoption for WS-VAD systems.
Citation
Salem Saqer Majid Almarri, “On Utilizing Auxiliary Information for Weakly-Supervised Video Anomaly Detection,” Doctor of Philosophy thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Weakly-Supervised Video Anomaly Detection, Segment-Level Ambiguity, Optimal Transport Alignment, Vision-Language Models, Modality Corruption Robustness
