Item

Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models

Jiang, Longtao
Huang, Jie
Han, Mingfei
Chen, Lei
Yu, Yongqiang
Zhao, Feng
Chang, Xiaojun
Li, Zhihui
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
License
Language
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Text-guided image inpainting aims to inpaint masked image regions based on a textual prompt while preserving the background. Although diffusion-based methods have become dominant, their property of modeling the entire image in latent space makes it challenging for the results to align well with prompt details and maintain a consistent background. To address these issues, we explore Mask AutoRegressive (MAR) models for this task. MAR naturally supports image inpainting by generating latent tokens corresponding to mask regions, enabling better local controllability without altering the background. However, directly applying MAR to this task makes the inpainting content either ignore the prompts or be disharmonious with the background context. Through analysis of the attention maps from the inpainting images, we identify the impact of background tokens on text tokens during the MAR generation, and leverage this to designToken Painter, a training-free text-guided image inpainting method based on MAR. Our approach introduces two key components: (1) Dual-Stream Encoder Information Fusion (DEIF), which fuses the semantic and context information from text and background in frequency domain to produce novel guidance tokens, allowing MAR to generate text-faithful inpainting content while keeping harmonious with background context. (2) Adaptive Decoder Attention Score Enhancing (ADAE), which adaptively enhances attention scores on guidance tokens and inpainting tokens to further enhance the alignment of prompt details and the content visual quality. Extensive experiments demonstrate that our training-free method outperforms prior state-of-the-art methods across almost all metrics.
Citation
L. Jiang, J. Huang, M. Han, L. Chen, Y. Yu, F. Zhao , et al., "Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models," 2026, pp. 5397-5405.
Source
Proceedings of the AAAI Conference on Artificial Intelligence
Conference
The Thirty-Ninth AAAI Conference on Artificial Intelligence
Keywords
46 Information and Computing Sciences, 4603 Computer Vision and Multimedia Computation
Subjects
Source
The Thirty-Ninth AAAI Conference on Artificial Intelligence
Publisher
Association for the Advancement of Artificial Intelligence
Full-text link