ChatGPT-driven Prompt Generation for Vision-Language Models
Gao, Zhengqing
Gao, Zhengqing
Author
Supervisor
Department
Machine Learning
Embargo End Date
Type
Thesis
Date
2023
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
As there is a growing interest in large vision-language pre-trained models ($e.g., $CLIP), researchers have dedicated relentless effort to constructing prompts efficiently. However, due to the domain-shift problems that widely exist in various real-world scenarios, it is still an open problem that how to effectively adapt vision-language pre-trained models to multiple downstream tasks. One of the most popular methods to address the aforementioned issue is prompt learning, which fixes the model itself and learn efficient prompts for the images that are fed into the model. Because one single global prompt might be limited to describe fine-grained features of images, researchers propose to learn multiple prompts to describe both extrinsic and intrinsic local features. The optimal transport is adopted to avoid multiple prompts converges into one single point by learning an optimal transport plan that minimizes the distance from one distribution to another. Furthermore, visual prompts learning is proposed to learn prompts for visual features. Albeit prompts learning approaches bridge the gap caused by domain-shift issues, it is still expensive to handle downstream tasks that require fine-grained prompts and manually labeled data. The advent of ChatGPT makes it possible to learn fine-grained prompts without a large amount of labeled data. We take advantage of the excellent real-world understanding ability of ChatGPT to explore the effectiveness in adapting vision-language pre-trained models to downstream tasks. We first use ChatGPT to give textual prompts for datasets and class categories, then we propose to learn multiple visual prompts via the optimal transport. Extensive experiments are conducted to verify the superiority of our approach on few-shot recognition, fine-grained retrieval tasks and domain generalization ability.
Citation
Z. Gao, "ChatGPT-driven Prompt Generation for Vision-Language Models", M.S. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2023.
