HumanDreamer: Generating Controllable Human-Motion Videos via Decoupled Generation

Kavli Affiliate: Zheng Zhu

| First 5 Authors: Boyuan Wang, Xiaofeng Wang, Chaojun Ni, Guosheng Zhao, Zhiqin Yang

| Summary:

Human-motion video generation has been a challenging task, primarily due to
the difficulty inherent in learning human body movements. While some approaches
have attempted to drive human-centric video generation explicitly through pose
control, these methods typically rely on poses derived from existing videos,
thereby lacking flexibility. To address this, we propose HumanDreamer, a
decoupled human video generation framework that first generates diverse poses
from text prompts and then leverages these poses to generate human-motion
videos. Specifically, we propose MotionVid, the largest dataset for
human-motion pose generation. Based on the dataset, we present MotionDiT, which
is trained to generate structured human-motion poses from text prompts.
Besides, a novel LAMA loss is introduced, which together contribute to a
significant improvement in FID by 62.4%, along with respective enhancements in
R-precision for top1, top2, and top3 by 41.8%, 26.3%, and 18.3%, thereby
advancing both the Text-to-Pose control accuracy and FID metrics. Our
experiments across various Pose-to-Video baselines demonstrate that the poses
generated by our method can produce diverse and high-quality human-motion
videos. Furthermore, our model can facilitate other downstream tasks, such as
pose sequence prediction and 2D-3D motion lifting.

| Search Query: ArXiv Query: search_query=au:”Zheng Zhu”&id_list=&start=0&max_results=3

Read More