GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Kavli Affiliate: Jing Wang

| First 5 Authors: NVIDIA, :, Johan Bjorck, Fernando CastaƱeda, Nikita Cherniadev

| Summary:

General-purpose robots need a versatile body and an intelligent mind. Recent
advancements in humanoid robots have shown great promise as a hardware platform
for building generalist autonomy in the human world. A robot foundation model,
trained on massive and diverse data sources, is essential for enabling the
robots to reason about novel situations, robustly handle real-world
variability, and rapidly learn new tasks. To this end, we introduce GR00T N1,
an open foundation model for humanoid robots. GR00T N1 is a
Vision-Language-Action (VLA) model with a dual-system architecture. The
vision-language module (System 2) interprets the environment through vision and
language instructions. The subsequent diffusion transformer module (System 1)
generates fluid motor actions in real time. Both modules are tightly coupled
and jointly trained end-to-end. We train GR00T N1 with a heterogeneous mixture
of real-robot trajectories, human videos, and synthetically generated datasets.
We show that our generalist robot model GR00T N1 outperforms the
state-of-the-art imitation learning baselines on standard simulation benchmarks
across multiple robot embodiments. Furthermore, we deploy our model on the
Fourier GR-1 humanoid robot for language-conditioned bimanual manipulation
tasks, achieving strong performance with high data efficiency.

| Search Query: ArXiv Query: search_query=au:”Jing Wang”&id_list=&start=0&max_results=3

Read More