Multi-modal Prompts with Primitives Enhancement for Compositional Zero-Shot Learning
Jin, Yutang ; Chen, Shiming ; Tong, Tianle ; Ding, Weiping ; Wang, Yisong
Jin, Yutang
Chen, Shiming
Tong, Tianle
Ding, Weiping
Wang, Yisong
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Compositional zero-shot learning (CZSL) aims to recognize novel compositions of known attributes and objects without requiring additional training data Recent CZSL methods based on vision-language models(eg, CLIP) suffer from relying solely on text prompts and neglecting the crucial primitive features within compositions, which limits generalization to unseen compositions To overcome these limitations, we propose a Multi-modal Prompt and Primitives Enhancement method, termed MPPE, which incorporates two key aspects First, MPPE introduces both text and visual prompts The text prompts consist of the composition and its corresponding attribute and object prompts, while the visual prompts leverage image masks generated by the segment anything model (SAM) These masks are integrated via an additional Alpha branch to strengthen the CLIP visual encoder to focus on regions of interest within the image Second, we design a primitives enhancement (PE) module based on cross-attention, which refines attribute and object features obtained from the CLIP text encoder, thereby enriching the representation of novel composition features Extensive experiments demonstrate the effectiveness of our approach, achieving state-of-the-art performance on three widely-used CZSL benchmarks in both closed-world and open-world CZSL scenarios Codes are available at https://githubcom/YtJin-git/MPPE © 1991-2012 IEEE
Citation
Y. Jin, S. Chen, T. Tong, W. Ding and Y. Wang, "Multi-modal Prompts with Primitives Enhancement for Compositional Zero-Shot Learning," in IEEE Transactions on Circuits and Systems for Video Technology, doi: 10.1109/TCSVT.2025.3577246
Source
IEEE Transactions on Circuits and Systems for Video Technology
Conference
Keywords
Compositional Zero-Shot Learning, Primitives Enhancement, Prompt Learning, Transformer, Vision-Language Models
Subjects
Source
Publisher
IEEE
