[Paper] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation
Source: arXiv - 2602.16705v1
Overview
The paper introduces HERO, a new framework that lets humanoid robots pick up and move arbitrary objects in everyday settings using only visual cues. By marrying large‑scale vision models (think CLIP‑style “open‑vocabulary” perception) with a highly accurate, learning‑augmented end‑effector (EE) controller trained in simulation, the authors achieve reliable loco‑manipulation across diverse real‑world environments—from office desks to coffee‑shop tables.
Key Contributions
- Residual‑aware EE tracking policy that blends classic inverse‑kinematics (IK) with a learned forward‑kinematics (FK) model, achieving a 3.2× reduction in tracking error.
- Modular integration of open‑vocabulary vision models (e.g., CLIP, ALIGN) for zero‑shot object recognition and pose estimation, enabling manipulation of any object describable in natural language.
- Simulation‑first training pipeline that produces a control policy transferable to real hardware without extensive real‑world data collection.
- Comprehensive evaluation on a full‑size humanoid robot across surfaces of varying heights (43 cm–92 cm) and in multiple indoor scenes, demonstrating robust pick‑and‑place of mugs, apples, toys, etc.
- Open‑source release of the EE tracker, vision adapters, and simulation environments to accelerate community research.
Methodology
-
Vision Front‑End – A large pre‑trained vision‑language model processes RGB‑D frames to produce an open‑vocabulary description of target objects and their 3D centroids. No task‑specific fine‑tuning is required.
-
Residual‑Aware EE Tracker
- Goal Generation: The vision module outputs a desired EE pose (position + orientation).
- Inverse Kinematics (IK) Residual: Classical IK computes a reference joint trajectory that would reach the goal if the robot’s kinematics were perfect.
- Neural Forward Model: A lightweight neural network predicts the actual EE pose resulting from the reference trajectory, capturing model errors, compliance, and sensor noise.
- Goal Adjustment & Replanning: The predicted pose is compared to the target; the residual is fed back to adjust the reference trajectory, and the process repeats at 20 Hz.
-
Control Stack – The refined joint commands are sent to a low‑level PD controller on the robot. The whole pipeline runs in real time on a single GPU‑enabled workstation.
-
Training Regime – The neural FK model and the residual policy are trained entirely in a high‑fidelity physics simulator (MuJoCo) using domain randomization (mass, friction, sensor noise) to bridge the sim‑to‑real gap.
Results & Findings
| Metric | Simulation | Real‑World |
|---|---|---|
| End‑effector tracking error (cm) | 1.2 | 1.5 |
| Success rate for pick‑and‑place (varied objects) | 94 % | 88 % |
| Reduction vs. baseline IK‑only | 3.2× lower error | 2.9× lower error |
| Generalization to unseen object categories (zero‑shot) | 91 % | 84 % |
Key takeaways
- The residual‑aware tracker consistently outperforms pure IK or pure learning baselines, especially on taller surfaces where small kinematic errors compound.
- Open‑vocabulary perception enables the robot to follow natural‑language commands (“grab the red mug”) without any per‑object training.
- The sim‑trained policy transfers with minimal degradation, confirming the effectiveness of the domain randomization strategy.
Practical Implications
- Rapid Prototyping of Service Robots – Developers can now equip a humanoid platform with a plug‑and‑play perception module and a pre‑trained EE tracker, bypassing costly data collection campaigns.
- Scalable Deployment – Because the vision component is zero‑shot, the same system can be rolled out across facilities (offices, hospitals, retail) and still understand locally‑specific objects.
- Modular Architecture – HERO’s clear separation (vision ↔ residual tracker ↔ low‑level controller) fits existing robotics stacks (ROS2, Isaac SDK), making integration straightforward.
- Safety & Reliability – The closed‑loop residual correction reduces overshoot and collision risk, a critical factor for humanoids operating near humans.
- Foundation for Higher‑Level Tasks – Accurate EE control is a prerequisite for whole‑body locomotion, tool use, and collaborative manipulation, opening avenues for more complex autonomous behaviors.
Limitations & Future Work
- Hardware Dependency – The current implementation assumes a high‑precision joint encoder suite and a reliable depth sensor; performance may degrade on cheaper platforms.
- Dynamic Objects – HERO focuses on static objects; handling moving targets (e.g., handing a cup to a person) remains an open challenge.
- Computation Load – Real‑time residual planning runs at ~20 Hz on a GPU; embedded deployments may need model pruning or edge‑accelerators.
- Generalization to Outdoor/Unstructured Terrain – The system has only been validated on indoor, relatively flat surfaces; extending to uneven ground will require integrating whole‑body balance controllers.
Future directions outlined by the authors include incorporating tactile feedback into the residual loop, scaling the vision front‑end to multimodal language commands, and evaluating the approach on larger fleets of heterogeneous humanoid robots.
Authors
- Runpei Dong
- Ziyan Li
- Xialin He
- Saurabh Gupta
Paper Information
- arXiv ID: 2602.16705v1
- Categories: cs.RO, cs.CV
- Published: February 18, 2026
- PDF: Download PDF