MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training

Kavli Affiliate: Zheng Zhu

| First 5 Authors: Haoyun Li, Haoyun Li, , ,

| Summary:

Vision Language Action (VLA) models derive their generalization capability
from diverse training data, yet collecting embodied robot interaction data
remains prohibitively expensive. In contrast, human demonstration videos are
far more scalable and cost-efficient to collect, and recent studies confirm
their effectiveness in training VLA models. However, a significant domain gap
persists between human videos and robot-executed videos, including unstable
camera viewpoints, visual discrepancies between human hands and robotic arms,
and differences in motion dynamics. To bridge this gap, we propose
MimicDreamer, a framework that turns fast, low-cost human demonstrations into
robot-usable supervision by jointly aligning vision, viewpoint, and actions to
directly support policy training. For visual alignment, we propose H2R Aligner,
a video diffusion model that generates high-fidelity robot demonstration videos
by transferring motion from human manipulation footage. For viewpoint
stabilization, EgoStabilizer is proposed, which canonicalizes egocentric videos
via homography and inpaints occlusions and distortions caused by warping. For
action alignment, we map human hand trajectories to the robot frame and apply a
constrained inverse kinematics solver to produce feasible, low-jitter joint
commands with accurate pose tracking. Empirically, VLA models trained purely on
our synthesized human-to-robot videos achieve few-shot execution on real
robots. Moreover, scaling training with human data significantly boosts
performance compared to models trained solely on real robot data; our approach
improves the average success rate by 14.7% across six representative
manipulation tasks.

| Search Query: ArXiv Query: search_query=au:”Zheng Zhu”&id_list=&start=0&max_results=3