Loading...
ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning
Hou, Wenjin ; Fu, Dingjie ; Li, Kun ; Chen, Shiming ; Fan, Hehe ; Yang, Yi
Hou, Wenjin
Fu, Dingjie
Li, Kun
Chen, Shiming
Fan, Hehe
Yang, Yi
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Zero-shot learning (ZSL) aims to recognize unseen classes by transferring semantic knowledge from seen classes to unseen ones, guided by semantic information. To this end, existing works have demonstrated remarkable performance by utilizing global visual features from Convolutional Neural Networks (CNNs) or Vision Transformers (ViTs) for visual-semantic interactions. Due to the limited receptive fields of CNNs and the quadratic complexity of ViTs, however, these visual backbones achieve suboptimal visual-semantic interactions. In this paper, motivated by the visual state space model (i.e., Vision Mamba), which is capable of capturing long-range dependencies and modeling complex visual dynamics, we propose a parameter-efficient ZSL framework called ZeroMamba to advance ZSL. Our ZeroMamba comprises three key components: Semantic-aware Local Projection (SLP), Global Representation Learning (GRL), and Semantic Fusion (SeF). Specifically, SLP integrates semantic embeddings to map visual features to local semantic-related representations, while GRL encourages the model to learn global semantic representations. SeF combines these two semantic representations to enhance the discriminability of semantic features. We incorporate these designs into Vision Mamba, forming an end-to-end ZSL framework. As a result, the learned semantic representations are better suited for classification. Through extensive experiments on four prominent ZSL benchmarks, ZeroMamba demonstrates superior performance, significantly outperforming the state-of-the-art (i.e., CNN-based and ViT-based) methods under both conventional ZSL (CZSL) and generalized ZSL (GZSL) settings.
Citation
W. Hou, D. Fu, K. Li, S. Chen, H. Fan, and Y. Yang, “ZeroMamba: Exploring Visual State Space Model for Zero-Shot Learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, pp. 3527–3535, Apr. 2025, doi: 10.1609/AAAI.V39I4.32366.
Source
Proceedings of the AAAI Conference on Artificial Intelligence
Conference
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Keywords
Convolutional neural networks, State space methods, Convolutional neural network, Global representation, Learning frameworks, Local projections, Performance, Semantic interactions, Semantic representation, Semantic-aware, State-space models, Visual semantics, Zero-shot learning
Subjects
Source
39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Publisher
Association for the Advancement of Artificial Intelligence
