Item

Spiking Video Understanding: Energy-Efficient Action Recognition using SNNs and Spike Camera with Complementary RGB and Thermal Modalities

Attia, Yasser Ashraf Saleh
Department
Machine Learning
Embargo End Date
2025-05-30
Type
Thesis
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Video Action Recognition (VAR) plays a crucial role in computer vision, with applications spanning surveillance, healthcare, sports analytics, and robotics. While deep learning modelssuch as convolutional neural networks (CNNs) and transformerbased architectureshave achieved impressive accuracy using RGB and thermal data, they come with a major drawback: high energy consumption. These models demand significant computational resources, making them impractical for realtime, edge, and lowpower applications. Inspired by the principles of neuromorphic computing, this research explores Spiking Neural Networks (SNNs) as an energy efficient alternative for video understanding tasks. Despite their potential, SNNs face notable challenges, particularly when processing temporally structured video data. A major hurdle in this field has been the lack of native spiking datasets compatible with SNNs and neuromorphic devices. To bridge this gap, this thesis introduces SPACT18, the first multimodal VAR dataset, to the best my knowledge, captured using a spike camera, synchronized with RGB and thermal imaging. SPACT18 consists of 18 daily activities recorded from 44 diverse participants, providing a rich and realistic benchmark for evaluating SNNbased models. The ultrahigh temporal resolution of spike cameras (up to 20,000 Hz) captures intricate motion details, offering a level of precision that surpasses traditional framebased recording methods. To further enhance efficiency, this research proposes a novel spiking data compression algorithm that reduces temporal redundancy while retaining essential eventdriven information. This approach significantly cuts down computational demands, making realtime VAR using SNN architectures more feasible. Extensive experiments compare multiple ANN models - including X3D, Slow Fast, I3D, and Uni Formeracross different data modalities: spiking, RGB, and thermal. Additionally, both direct SNN training and ANN-SNN conversion techniques are explored to assess their effectiveness in video recognition tasks. Key experimental findings include: 1. Thermal imaging outperforms RGB in lowlight conditions, making it highly effective for realworld applications. 2. Spiking data delivers competitive performance while it is compatible with neuromorphic devices and SNNs, paving the way towards full energyefficient video understanding. 3. Extreme compression reduces inference time but can lead to information loss, negatively impacting recognition accuracy. 4. Direct SNN training lags behind ANNs in video understanding by approximately 30%, despite achieving comparable performance in image classification. This gap is attributed to SNN optimization challenges for video models, and the absence of largescale SNN video datasets. 5. ANN-SNN conversion retains high accuracy but suffers from increased latency, highlighting a trade-off between computational efficiency and inference speed. This is due to the depth of 3D CNNs, the absence of ReLU-based activations in most video models, and the lack of applicability of existing ANN-to-SNN conversion methods for video tasks. Despite its success in image classification, ANNSNN conversion remains inferior to ANN models in video classification. 6. Hybrid ANN-SNN architectures show promising results, harnessing the strengths of both deep learning and neuromorphic computing. By integrating event-driven SNN models with traditional frame-based vision systems, this research lays the foundation for neuromorphic computing in video analysis. The introduction of SPACT18, alongside the proposed compression algorithm and benchmarking of SNN-based models, paves the way for real-time, low-power, and biologically inspired video recognition systems. Looking ahead, future research will focus on improving SNN-specific training methodologies, exploring multimodal fusion techniques, and refining hybrid models to enhance accuracy, efficiency, and real-world deployment of neuromorphic vision systems.
Citation
Yasser Ashraf Saleh Attia, “Spiking Video Understanding: Energy-Efficient Action Recognition using SNNs and Spike Camera with Complementary RGB and Thermal Modalities,” Master of Science thesis, Machine Learning, MBZUAI, 2025.
Source
Conference
Keywords
Video Understanding, Action recognition, Spiking Neural Network, Spike Camera, Multimodal, ANN-SNN Conversion
Subjects
Source
Publisher
DOI
Full-text link