FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Kavli Affiliate: Jing Wang

| First 5 Authors: Jiasong Feng, Jiasong Feng, , ,

| Summary:

Synthesizing motion-rich and temporally consistent videos remains a challenge
in artificial intelligence, especially when dealing with extended durations.
Existing text-to-video (T2V) models commonly employ spatial cross-attention for
text control, equivalently guiding different frame generations without
frame-specific textual guidance. Thus, the model’s capacity to comprehend the
temporal logic conveyed in prompts and generate videos with coherent motion is
restricted. To tackle this limitation, we introduce FancyVideo, an innovative
video generator that improves the existing text-control mechanism with the
well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM
incorporates the Temporal Information Injector (TII) and Temporal Affinity
Refiner (TAR) at the beginning and end of cross-attention, respectively, to
achieve frame-specific textual guidance. Firstly, TII injects frame-specific
information from latent features into text conditions, thereby obtaining
cross-frame textual conditions. Then, TAR refines the correlation matrix
between cross-frame textual conditions and latent features along the time
dimension. Extensive experiments comprising both quantitative and qualitative
evaluations demonstrate the effectiveness of FancyVideo. Our approach achieves
state-of-the-art T2V generation results on the EvalCrafter benchmark and
facilitates the synthesis of dynamic and consistent videos. Note that the T2V
process of FancyVideo essentially involves a text-to-image step followed by
T+I2V. This means it also supports the generation of videos from user images,
i.e., the image-to-video (I2V) task. A significant number of experiments have
shown that its performance is also outstanding.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3