Kavli Affiliate: Zheng Zhu | First 5 Authors: Haoyun Li, Haoyun Li, , , | Summary: Vision Language Action (VLA) models derive their generalization capability from diverse training data, yet collecting embodied robot interaction data remains prohibitively expensive. In contrast, human demonstration videos are far more scalable and cost-efficient to collect, and recent studies confirm […]
Continue.. MimicDreamer: Aligning Human and Robot Demonstrations for Scalable VLA Training