Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis
Hu, Taihang ; Li, Linxuan ; Van de Weijer, Joost ; Gao, Hongcheng ; Khan, Fahad Shahbaz ; Yang, Jian ; Cheng, Ming-Ming ; Wang, Kai ; Wang, Yaxing
Hu, Taihang
Li, Linxuan
Van de Weijer, Joost
Gao, Hongcheng
Khan, Fahad Shahbaz
Yang, Jian
Cheng, Ming-Ming
Wang, Kai
Wang, Yaxing
Supervisor
Department
Computer Vision
Embargo End Date
Type
Conference proceeding
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Although text-to-image (T2I) models exhibit remarkable generation capabilities,they frequently fail to accurately bind semantically related objects or attributesin the input prompts; a challenge termed semantic binding. Previous approacheseither involve intensive fine-tuning of the entire T2I model or require users orlarge language models to specify generation layouts, adding complexity. In thispaper, we define semantic binding as the task of associating a given object with itsattribute, termed attribute binding, or linking it to other related sub-objects, referredto as object binding. We introduce a novel method called Token Merging (ToMe),which enhances semantic binding by aggregating relevant tokens into a singlecomposite token. This ensures that the object, its attributes and sub-objects all sharethe same cross-attention map. Additionally, to address potential confusion amongmain objects with complex textual prompts, we propose end token substitution asa complementary strategy. To further refine our approach in the initial stages ofT2I generation, where layouts are determined, we incorporate two auxiliary losses,an entropy loss and a semantic binding loss, to iteratively update the compositetoken to improve the generation integrity. We conducted extensive experiments tovalidate the effectiveness of ToMe, comparing it against various existing methodson the T2I-CompBench and our proposed GPT-4o object binding benchmark. Ourmethod is particularly effective in complex scenarios that involve multiple objectsand attributes, which previous methods often fail to address. The code will be publicly available at https://github.com/hutaihang/ToMe
Citation
T. Hu et al., “Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis,” Adv Neural Inf Process Syst, vol. 37, pp. 137646–137672, Dec. 2024, Accessed: Mar. 24, 2025. [Online]. Available: https://github.com/hutaihang/ToMe
Source
Advances in Neural Information Processing Systems (NeurIPS 2024)
Conference
Keywords
Classifier-free guidance, Large Language Models (LLMs), Model unlearning, ORPO reinforcement learning, Synthetic data training
Subjects
Source
Publisher
NEURIPS
