Loading...
Rethinking Efficient Deep Learning Architectures for Visual Recognition
Abdelrahman Mohamed Shaker Youssief
Abdelrahman Mohamed Shaker Youssief
Supervisor
Department
Computer Vision
Embargo End Date
30/05/2025
Type
Dissertation
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Deep learning has achieved remarkable success in numerous visual recognition tasks across multiple data modalities. Vision transformers (ViTs) and statespace models (SSMs) have recently emerged as powerful alternatives to traditional convolutional architectures (CNNs), demonstrating strong performance in various visual recognition tasks. Despite their recent advances, deploying them on resource-constrained environments (e.g., mobile phones, wearables, and edge devices) remains a significant challenge due to high computational complexity, memory overhead, and power constraints. This thesis addresses these limitations by proposing novel computational blocks and efficient yet accurate deeplearning architectures for visual recognition across multiple modalities. For image-based tasks (classification, detection, and segmentation), we first introduce a hybrid CNN-Transformer architecture that integrates convolution and self-attention using a novel Split Depth-Wise Transpose Attention (SDTA) encoder. It reduces computational complexity while maintaining high recognition accuracy, enabling real-time inference on the NVIDIA Jetson Nano. Second, a lightweight mobile vision backbone is introduced by leveraging an efficient additive attention mechanism. Instead of traditional multi-headed self-attention, it employs linear element-wise multiplications to achieve superior latency on mobile devices (e.g., iPhone), while remaining competitive with state-of-the-art models. Third, a parameter-efficient state-space architecture is introduced with a Modulated Group Mamba layer, which enhances efficiency through multi-directional scanning across channel groups, with comprehensive spatial coverage and effective local-global modeling. For video recognition tasks, an efficient video segmentation (VOS) approach is introduced that incorporates a dynamic and long-term Modulated Cross-Attention (MCA) memory mechanism, substantially optimizing VOS efficiency while preserving the characteristic ViT-design accuracy. Furthermore, an accurate lightweight video multimodal architecture is proposed that integrates efficient visual encoders, small language model, and novel token projection mechanisms for video understanding. The final contribution of this thesis is developing an efficient and accurate transformer based architecture for volumetric segmentation. In contrast to standard self-attention, the proposed hierarchical architecture leverages Efficient Paired Attention (EPA) blocks to effectively model spatial and channel-wise features with linear computational complexity. The proposed approach is computationally efficient and achieves promising performance across multiple medical segmentation tasks: multi-organ, brain tumor, lung cancer, and cardiac segmentation.
Co-author(s)
Citation
Abdelrahman Mohamed Shaker Youssief, “Rethinking Efficient Deep Learning Architectures for Visual Recognition,” Doctor of Philosophy thesis, Computer Vision, MBZUAI, 2025.
Source
Conference
Keywords
Efficient Deep Learning Architectures, Efficient Image Recognition, Efficient Video Recognition, Efficient Volumetric (3D) Data Recognition, Edge Devices