PromptEnhancer: A Simple Approach to Enhance Text-to-Image Models via Chain-of-Thought Prompt Rewriting

Kavli Affiliate: Li Xin Li

| First 5 Authors: Linqing Wang, Linqing Wang, , ,

| Summary:

Recent advancements in text-to-image (T2I) diffusion models have demonstrated
remarkable capabilities in generating high-fidelity images. However, these
models often struggle to faithfully render complex user prompts, particularly
in aspects like attribute binding, negation, and compositional relationships.
This leads to a significant mismatch between user intent and the generated
output. To address this challenge, we introduce PromptEnhancer, a novel and
universal prompt rewriting framework that enhances any pretrained T2I model
without requiring modifications to its weights. Unlike prior methods that rely
on model-specific fine-tuning or implicit reward signals like image-reward
scores, our framework decouples the rewriter from the generator. We achieve
this by training a Chain-of-Thought (CoT) rewriter through reinforcement
learning, guided by a dedicated reward model we term the AlignEvaluator. The
AlignEvaluator is trained to provide explicit and fine-grained feedback based
on a systematic taxonomy of 24 key points, which are derived from a
comprehensive analysis of common T2I failure modes. By optimizing the CoT
rewriter to maximize the reward from our AlignEvaluator, our framework learns
to generate prompts that are more precisely interpreted by T2I models.
Extensive experiments on the HunyuanImage 2.1 model demonstrate that
PromptEnhancer significantly improves image-text alignment across a wide range
of semantic and compositional challenges. Furthermore, we introduce a new,
high-quality human preference benchmark to facilitate future research in this
direction.

| Search Query: ArXiv Query: search_query=au:”Li Xin Li”&id_list=&start=0&max_results=3