Kavli Affiliate: Jing Wang
| First 5 Authors: Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng
| Summary:
“Creative” remains an inherently abstract concept for both humans and
diffusion models. While text-to-image (T2I) diffusion models can easily
generate out-of-domain concepts like “a blue banana”, they struggle with
generating combinatorial objects such as “a creative mixture that resembles a
lettuce and a mantis”, due to difficulties in understanding the semantic depth
of “creative”. Current methods rely heavily on synthesizing reference prompts
or images to achieve a creative effect, typically requiring retraining for each
unique creative output — a process that is computationally intensive and
limits practical applications. To address this, we introduce CreTok, which
brings meta-creativity to diffusion models by redefining “creative” as a new
token, texttt{<CreTok>}, thus enhancing models’ semantic understanding for
combinatorial creativity. CreTok achieves such redefinition by iteratively
sampling diverse text pairs from our proposed CangJie dataset to form adaptive
prompts and restrictive prompts, and then optimizing the similarity between
their respective text embeddings. Extensive experiments demonstrate that
texttt{<CreTok>} enables the universal and direct generation of combinatorial
creativity across diverse concepts without additional training (4s vs. BASS’s
2400s per image), achieving state-of-the-art performance with improved
text-image alignment ($uparrow$0.03 in VQAScore) and higher human preference
ratings ($uparrow$0.009 in PickScore and $uparrow$0.169 in ImageReward).
Further evaluations with GPT-4o and user studies underscore CreTok’s strengths
in advancing creative generation.
| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3