Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning
Feng, Chun-Mei ; Yu, Kai ; Xu, Xinxing ; Khan, Salman ; Goh, Rick Siow Mong ; Zuo, Wangmeng ; Liu, Yong
Feng, Chun-Mei
Yu, Kai
Xu, Xinxing
Khan, Salman
Goh, Rick Siow Mong
Zuo, Wangmeng
Liu, Yong
Supervisor
Department
Computer Vision
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Benefited from image-text contrastive learning, pre-trained vision-language models, e.g., CLIP, allow to direct leverage texts as images (TaI) for parameter-efficient fine-tuning (PEFT). While CLIP is capable of making image features to be similar to the corresponding text features, the modality gap remains a nontrivial issue and limits image recognition performance of TaI. Using multi-label image recognition (MLR) as an example, we present a novel method, called T2I-PAL to tackle the modality gap issue when using only text captions for PEFT. The core design of T2I-PAL is to leverage pre-trained text-to-image generation models to generate photo-realistic and diverse images from text captions, thereby reducing the modality gap. To further enhance MLR, T2I-PAL incorporates a class-wise heatmap and learnable prototypes. This aggregates local similarities, making the representation of local visual features more robust and informative for multi-label recognition. For better PEFT, we further combine both prompt tuning and adapter learning to enhance classification performance. T2I-PAL offers significant advantages: it eliminates the need for fully semantically annotated training images, thereby reducing the manual annotation workload, and it preserves the intrinsic mode of the CLIP model, allowing for seamless integration with any existing CLIP framework. Extensive experiments on multiple benchmarks, including MS-COCO, VOC2007, and NUS-WIDE, show that our T2I-PAL can boost recognition performance by 3.47% in average above the top-ranked state-of-the-art methods.
Citation
C. M. Feng et al., “Text to Image for Multi-Label Image Recognition with Joint Prompt-Adapter Learning,” IEEE Trans Pattern Anal Mach Intell, 2025, doi: 10.1109/TPAMI.2025.3573852.
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
Adapter, Multi-Label Image Recognition, Prompt Learning, Text to Image
Subjects
Source
Publisher
IEEE
