Frequency-Based Comprehensive Prompt Learning for Vision-Language Models
Liu, Liangchen ; Wang, Nannan ; Chen, Chen ; Liu, Decheng ; Yang, Xi ; Gao, Xinbo ; Liu, Tongliang
Liu, Liangchen
Wang, Nannan
Chen, Chen
Liu, Decheng
Yang, Xi
Gao, Xinbo
Liu, Tongliang
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
This paper targets to learn multiple comprehensive text prompts that can describe the visual concepts from coarse to fine, thereby endowing pre-trained VLMs with better transfer ability to various downstream tasks. We focus on exploring this idea on transformer-based VLMs since this kind of architecture achieves more compelling performances than CNN-based ones. Unfortunately, unlike CNNs, the transformer-based visual encoder of pre-trained VLMs cannot naturally provide discriminative and representative local visual information. To solve this problem, we propose Frequency-based Comprehensive Prompt Learning (FCPrompt) to excavate representative local visual information from the redundant output features of the visual encoder. FCPrompt transforms these features into frequency domain via Discrete Cosine Transform (DCT). Taking the advantages of energy concentration and information orthogonality of DCT, we can obtain compact, informative and disentangled local visual information by leveraging specific frequency components of the transformed frequency features. To better fit with transformer architectures, FCPrompt further adopts and optimizes different text prompts to respectively align with the global and frequency-based local visual information via a dual-branch framework. Finally, the learned text prompts can thus describe the entire visual concepts from coarse to fine comprehensively. Extensive experiments indicate that FCPrompt achieves the state-of-the-art performances on various benchmarks. Code is available at https://github.com/llcllc1997/FCPrompt.
Citation
L. Liu et al., "Frequency-Based Comprehensive Prompt Learning for Vision-Language Models," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2025.3599830
Source
IEEE Transactions on Pattern Analysis and Machine Intelligence
Conference
Keywords
Parameter-Efficient Fine-Tuning, Prompt Learning, Transfer Learning, Vision-Language Model
Subjects
Source
Publisher
IEEE
