[Paper] Learning Sim-to-Real Humanoid Locomotion in 15 Minutes

Published: (December 1, 2025 at 01:55 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.01996v1

Overview

The paper presents a surprisingly fast pipeline for teaching humanoid robots to walk, run, and even imitate human motion—training a full‑body controller in just 15 minutes on a single RTX 4090 GPU. By leveraging off‑policy reinforcement learning (RL) algorithms that scale to thousands of parallel simulations, the authors show that high‑dimensional humanoid locomotion can move from “days of compute” to “minutes of training” while still transferring robustly to real robots.

Key Contributions

  • FastSAC & FastTD3 recipes: Simple, off‑policy RL variants that remain stable when run on massive parallelism (thousands of environments).
  • 15‑minute end‑to‑end training: Demonstrated on two commercial humanoids (Unitree G1 and Booster T1) using a single high‑end GPU.
  • Strong domain randomization: Randomized dynamics, uneven terrain, and external pushes are baked into training, yielding policies that survive real‑world disturbances.
  • Whole‑body motion tracking: The same pipeline can learn policies that follow human motion capture data, opening doors to expressive robot behaviors.
  • Open‑source release: Code, pretrained models, and video demos are publicly available, encouraging reproducibility and community extensions.

Methodology

  1. Massively Parallel Simulation – The authors spin up thousands of lightweight physics simulations (MuJoCo) on the GPU, feeding each environment with its own randomised parameters (mass, friction, terrain height, etc.).
  2. Off‑policy RL Core – They adapt Soft Actor‑Critic (SAC) and Twin‑Delayed DDPG (TD3) with a few stability tricks:
    • Minimalist reward shaping (mostly penalising falls and encouraging forward velocity).
    • Gradient clipping and target‑network smoothing tuned for high‑throughput updates.
    • Experience replay buffers shared across all parallel environments, ensuring data efficiency.
  3. Domain Randomization Loop – At the start of each episode, the simulator samples a new set of dynamics and terrain parameters, forcing the policy to learn a robust, generalizable control law.
  4. Policy Deployment – After training, the learned neural network (≈ 2 M parameters) runs on the robot’s onboard computer, receiving proprioceptive observations and outputting joint torques at 100 Hz.

Results & Findings

RobotTraining TimeSuccess Rate (real‑world)Disturbances Handled
Unitree G115 min94 % (no falls over 30 min test)Random pushes up to 15 N, uneven terrain ±5 cm
Booster T115 min91 %Same as above, plus slope up to 10°
Motion‑tracking (human clips)15 minAccurate pose following (average joint error < 5°)Robust to sensor noise

Key takeaways

  • Training speed is orders of magnitude faster than prior works that required days on multi‑GPU clusters.
  • Robustness emerges directly from the heavy randomization; policies rarely need post‑training fine‑tuning.
  • Simplicity wins – the minimalist reward design avoids the brittle hand‑crafted shaping that often hampers transfer.

Practical Implications

  • Rapid Prototyping – Developers can iterate on locomotion behaviors in minutes, dramatically shortening the hardware‑in‑the‑loop development cycle.
  • Cost‑Effective Scaling – A single consumer‑grade GPU suffices, making large‑scale RL research accessible to startups and university labs.
  • Plug‑and‑Play Controllers – The released policies can be dropped into existing robot stacks (ROS2, Unitree SDK) with minimal integration effort.
  • Adaptive Robots – Because the policy already expects a wide range of dynamics, it can be re‑used across robot variants or after hardware wear‑and‑tear without retraining.
  • Human‑Robot Interaction – Whole‑body motion‑tracking opens possibilities for robots that mimic human gestures or perform expressive tasks (e.g., assistive caregiving, entertainment).

Limitations & Future Work

  • Hardware Constraints – While the training runs on a single GPU, the inference still assumes a capable onboard processor; very low‑power platforms may need model compression.
  • Simulation Fidelity – The approach relies on MuJoCo’s fast but approximate physics; transferring to robots with highly compliant hardware may expose gaps.
  • Task Diversity – Experiments focus on locomotion and motion tracking; extending to manipulation or multi‑modal tasks remains an open challenge.
  • Safety Guarantees – The policies are robust but not formally verified; future work could integrate safety‑layer controllers or learning‑based verification.

Overall, the paper demonstrates that with the right algorithmic tweaks and massive parallel simulation, the “sim‑to‑real gap” for high‑dimensional humanoid control can be bridged in minutes rather than months—an exciting step toward truly agile, adaptable robots in the field.

Authors

  • Younggyo Seo
  • Carmelo Sferrazza
  • Juyue Chen
  • Guanya Shi
  • Rocky Duan
  • Pieter Abbeel

Paper Information

  • arXiv ID: 2512.01996v1
  • Categories: cs.RO, cs.AI, cs.LG
  • Published: December 1, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »