Item

Chain-of-Focus Prompting: Leveraging Sequential Visual Cues to Prompt Large Autoregressive Vision Models

Zheng, Jiyang
Shen, Jialiang
Yao, Yu
Wang, Min
Yang, Yang
Wang, Dadong
Liu, Tongliang
Supervisor
Department
Machine Learning
Embargo End Date
Type
Poster
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In-context learning (ICL) has revolutionized natural language processing by enabling models to adapt to diverse tasks with only a few illustrative examples. However, the exploration of ICL within the field of computer vision remains limited. Inspired by Chain-of-Thought (CoT) prompting in the language domain, we propose Chain-of-Focus (CoF) Prompting, which enhances vision models by enabling step-by-step visual comprehension. CoF Prompting addresses the challenges of absent logical structure in visual data by generating intermediate reasoning steps through visual saliency. Moreover, it provides a solution for creating tailored prompts from visual inputs by selecting contextually informative prompts based on query similarity and target richness. The significance of CoF prompting is demonstrated by the recent introduction of Large Autoregressive Vision Models (LAVMs), which predict downstream targets via in-context learning with pure visual inputs. By integrating intermediate reasoning steps into visual prompts and effectively selecting the informative ones, the LAVMs are capable of generating significantly better inferences. Extensive experiments on downstream visual understanding tasks validate the effectiveness of our proposed method for visual in-context learning.
Citation
J. Zheng, J. Shen, Y. Yao, M. Wang, Y. Yang, D. Wang, and T. Liu, “Chain-of-Focus Prompting: Leveraging Sequential Visual Cues to Prompt Large Autoregressive Vision Models,” in Proc. Int. Conf. Learn. Representations (ICLR), 2025.
Source
International Conference on Learning Representations (ICLR)
Conference
Keywords
Computer Vision, Visual In-context Learning
Subjects
Source
Publisher
ICLR
DOI
Full-text link