Rethinking Efficient Deep Learning Architectures for Visual Recognition

Abdelrahman Mohamed Shaker Youssief

Item

Rethinking Efficient Deep Learning Architectures for Visual Recognition

Abdelrahman Mohamed Shaker Youssief

Department

Computer Vision

Embargo End Date

30/05/2025

Type

Dissertation

Date

2025

Language

English

Collections

Dissertations

Show all metadata

Abstract

Deep learning has achieved remarkable success in numerous visual recognition tasks across multiple data modalities. Vision transformers (ViTs) and statespace models (SSMs) have recently emerged as powerful alternatives to traditional convolutional architectures (CNNs), demonstrating strong performance in various visual recognition tasks. Despite their recent advances, deploying them on resource-constrained environments (e.g., mobile phones, wearables, and edge devices) remains a significant challenge due to high computational complexity, memory overhead, and power constraints. This thesis addresses these limitations by proposing novel computational blocks and efficient yet accurate deeplearning architectures for visual recognition across multiple modalities. For image-based tasks (classification, detection, and segmentation), we first introduce a hybrid CNN-Transformer architecture that integrates convolution and self-attention using a novel Split Depth-Wise Transpose Attention (SDTA) encoder. It reduces computational complexity while maintaining high recognition accuracy, enabling real-time inference on the NVIDIA Jetson Nano. Second, a lightweight mobile vision backbone is introduced by leveraging an efficient additive attention mechanism. Instead of traditional multi-headed self-attention, it employs linear element-wise multiplications to achieve superior latency on mobile devices (e.g., iPhone), while remaining competitive with state-of-the-art models. Third, a parameter-efficient state-space architecture is introduced with a Modulated Group Mamba layer, which enhances efficiency through multi-directional scanning across channel groups, with comprehensive spatial coverage and effective local-global modeling. For video recognition tasks, an efficient video segmentation (VOS) approach is introduced that incorporates a dynamic and long-term Modulated Cross-Attention (MCA) memory mechanism, substantially optimizing VOS efficiency while preserving the characteristic ViT-design accuracy. Furthermore, an accurate lightweight video multimodal architecture is proposed that integrates efficient visual encoders, small language model, and novel token projection mechanisms for video understanding. The final contribution of this thesis is developing an efficient and accurate transformer based architecture for volumetric segmentation. In contrast to standard self-attention, the proposed hierarchical architecture leverages Efficient Paired Attention (EPA) blocks to effectively model spatial and channel-wise features with linear computational complexity. The proposed approach is computationally efficient and achieves promising performance across multiple medical segmentation tasks: multi-organ, brain tumor, lung cancer, and cardiac segmentation.

Citation

Abdelrahman Mohamed Shaker Youssief, “Rethinking Efficient Deep Learning Architectures for Visual Recognition,” Doctor of Philosophy thesis, Computer Vision, MBZUAI, 2025.

Keywords

Efficient Deep Learning Architectures, Efficient Image Recognition, Efficient Video Recognition, Efficient Volumetric (3D) Data Recognition, Edge Devices

Additional links

https://mbzuaiac-my.sharepoint.com/:b:/g/personal/libraryservices_mbzuai_ac_ae/EQQdIG3hmA1Cg8UYBZ4JtFEBzNpXVLVimatAuTmR26ZJGg?e=pphK4b

Rethinking Efficient Deep Learning Architectures for Visual Recognition

Abdelrahman Mohamed Shaker Youssief

Files

Author

Supervisor

Department

Embargo End Date

Type

Date

License

Language

Collections

Research Projects

Organizational Units

Journal Issue

Abstract

Co-author(s)

Citation

Source

Conference

Keywords

Subjects

Source

Publisher

DOI

Additional links

Full-text link