FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Kavli Affiliate: Jing Wang

| First 5 Authors: Jiasong Feng, Ao Ma, Jing Wang, Bo Cheng, Xiaodan Liang

| Summary:

Synthesizing motion-rich and temporally consistent videos remains a challenge
in artificial intelligence, especially when dealing with extended durations.
Existing text-to-video (T2V) models commonly employ spatial cross-attention for
text control, equivalently guiding different frame generations without
frame-specific textual guidance. Thus, the model’s capacity to comprehend the
temporal logic conveyed in prompts and generate videos with coherent motion is
restricted. To tackle this limitation, we introduce FancyVideo, an innovative
video generator that improves the existing text-control mechanism with the
well-designed Cross-frame Textual Guidance Module (CTGM). Specifically, CTGM
incorporates the Temporal Information Injector (TII), Temporal Affinity Refiner
(TAR), and Temporal Feature Booster (TFB) at the beginning, middle, and end of
cross-attention, respectively, to achieve frame-specific textual guidance.
Firstly, TII injects frame-specific information from latent features into text
conditions, thereby obtaining cross-frame textual conditions. Then, TAR refines
the correlation matrix between cross-frame textual conditions and latent
features along the time dimension. Lastly, TFB boosts the temporal consistency
of latent features. Extensive experiments comprising both quantitative and
qualitative evaluations demonstrate the effectiveness of FancyVideo. Our video
demo, code and model are available at https://360cvgroup.github.io/FancyVideo/.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3