Item

Text-to-Image Diffusion with Complex and Detailed Prompts

Gani, Mohammad
Department
Machine Learning
Embargo End Date
2024-01-01
Type
Thesis
Date
2024
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Diffusion-based generative models have significantly ascended to the prominence providing massive leap forward in the field of computer vision, emerging as a key tool to unlock the vast potential of generative AI capabilities. Among their extensive applications in several potential fields, their ability to generate high quality images from textual prompts is remarkable. This intersection of text and image synthesis has the potential to revolutionize content creation, design, and various other domains where conveying complex visual concepts from textual descriptions is paramount. However, despite their remarkable achievements, diffusion-based generative models encounter notable hurdles when tasked with processing lengthy and intricate textual prompts. These challenges become particularly pronounced when describing scenes with multiple objects, intricate attributes, and nuanced contextual details. While these models excel in faithfully generating images from succinct, single-object descriptions, they often struggle to capture the richness and complexity inherent in longer textual inputs. This limitation poses a significant barrier, hindering their ability to accurately translate intricate textual descriptions into coherent visual representations. To mitigate these issues, in this work, we present a novel training-free approach leveraging a Large Language Models (LLMs) to extract critical components from text prompts, including bounding box coordinates for foreground objects, detailed textual descriptions for individual objects, and a succinct background context. These components form the foundation of our layout-to-image generation model, which operates in two phases. The initial Global Scene Generation utilizes object layouts and background context to create an initial scene but often falls short in faithfully representing object characteristics as specified in the prompts. To address this limitation, we introduce an Iterative Refinement Scheme that iteratively evaluates and refines box-level content to align them with their textual descriptions, recomposing objects as needed to ensure consistency. Our evaluation on complex prompts featuring multiple objects demonstrates a substantial improvement in recall compared to baseline diffusion models. This is further validated by a user study, underscoring the efficacy of our approach in generating coherent and detailed scenes from intricate textual inputs. Our iterative framework offers a promising solution for enhancing text-to-image generation models' fidelity with lengthy, multifaceted descriptions, opening new possibilities for accurate and diverse image synthesis from textual inputs.
Citation
M. Gani, "Text-to-Image Diffusion with Complex and Detailed Prompts", MS. Thesis, Machine Learning, MBZUAI, Abu Dhabi, UAE, 2024
Source
Conference
Keywords
Subjects
Source
Publisher
DOI
Full-text link