[Paper] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Published: 3 days ago (February 18, 2026 at 01:55 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16705v1

Overview

The paper introduces HERO, a new framework that lets humanoid robots pick up and move arbitrary objects in everyday settings using only visual cues. By marrying large‑scale vision models (think CLIP‑style “open‑vocabulary” perception) with a highly accurate, learning‑augmented end‑effector (EE) controller trained in simulation, the authors achieve reliable loco‑manipulation across diverse real‑world environments—from office desks to coffee‑shop tables.

Key Contributions

Residual‑aware EE tracking policy that blends classic inverse‑kinematics (IK) with a learned forward‑kinematics (FK) model, achieving a 3.2× reduction in tracking error.
Modular integration of open‑vocabulary vision models (e.g., CLIP, ALIGN) for zero‑shot object recognition and pose estimation, enabling manipulation of any object describable in natural language.
Simulation‑first training pipeline that produces a control policy transferable to real hardware without extensive real‑world data collection.
Comprehensive evaluation on a full‑size humanoid robot across surfaces of varying heights (43 cm–92 cm) and in multiple indoor scenes, demonstrating robust pick‑and‑place of mugs, apples, toys, etc.
Open‑source release of the EE tracker, vision adapters, and simulation environments to accelerate community research.

Methodology

Vision Front‑End – A large pre‑trained vision‑language model processes RGB‑D frames to produce an open‑vocabulary description of target objects and their 3D centroids. No task‑specific fine‑tuning is required.
Residual‑Aware EE Tracker
- Goal Generation: The vision module outputs a desired EE pose (position + orientation).
- Inverse Kinematics (IK) Residual: Classical IK computes a reference joint trajectory that would reach the goal if the robot’s kinematics were perfect.
- Neural Forward Model: A lightweight neural network predicts the actual EE pose resulting from the reference trajectory, capturing model errors, compliance, and sensor noise.
- Goal Adjustment & Replanning: The predicted pose is compared to the target; the residual is fed back to adjust the reference trajectory, and the process repeats at 20 Hz.
Control Stack – The refined joint commands are sent to a low‑level PD controller on the robot. The whole pipeline runs in real time on a single GPU‑enabled workstation.
Training Regime – The neural FK model and the residual policy are trained entirely in a high‑fidelity physics simulator (MuJoCo) using domain randomization (mass, friction, sensor noise) to bridge the sim‑to‑real gap.

Results & Findings

Metric	Simulation	Real‑World
End‑effector tracking error (cm)	1.2	1.5
Success rate for pick‑and‑place (varied objects)	94 %	88 %
Reduction vs. baseline IK‑only	3.2× lower error	2.9× lower error
Generalization to unseen object categories (zero‑shot)	91 %	84 %

Key takeaways

The residual‑aware tracker consistently outperforms pure IK or pure learning baselines, especially on taller surfaces where small kinematic errors compound.
Open‑vocabulary perception enables the robot to follow natural‑language commands (“grab the red mug”) without any per‑object training.
The sim‑trained policy transfers with minimal degradation, confirming the effectiveness of the domain randomization strategy.

Practical Implications

Rapid Prototyping of Service Robots – Developers can now equip a humanoid platform with a plug‑and‑play perception module and a pre‑trained EE tracker, bypassing costly data collection campaigns.
Scalable Deployment – Because the vision component is zero‑shot, the same system can be rolled out across facilities (offices, hospitals, retail) and still understand locally‑specific objects.
Modular Architecture – HERO’s clear separation (vision ↔ residual tracker ↔ low‑level controller) fits existing robotics stacks (ROS2, Isaac SDK), making integration straightforward.
Safety & Reliability – The closed‑loop residual correction reduces overshoot and collision risk, a critical factor for humanoids operating near humans.
Foundation for Higher‑Level Tasks – Accurate EE control is a prerequisite for whole‑body locomotion, tool use, and collaborative manipulation, opening avenues for more complex autonomous behaviors.

Limitations & Future Work

Hardware Dependency – The current implementation assumes a high‑precision joint encoder suite and a reliable depth sensor; performance may degrade on cheaper platforms.
Dynamic Objects – HERO focuses on static objects; handling moving targets (e.g., handing a cup to a person) remains an open challenge.
Computation Load – Real‑time residual planning runs at ~20 Hz on a GPU; embedded deployments may need model pruning or edge‑accelerators.
Generalization to Outdoor/Unstructured Terrain – The system has only been validated on indoor, relatively flat surfaces; extending to uneven ground will require integrating whole‑body balance controllers.

Future directions outlined by the authors include incorporating tactile feedback into the residual loop, scaling the vision front‑end to multimodal language commands, and evaluating the approach on larger fleets of heterogeneous humanoid robots.

Authors

Runpei Dong
Ziyan Li
Xialin He
Saurabh Gupta

Paper Information

arXiv ID: 2602.16705v1
Categories: cs.RO, cs.CV
Published: February 18, 2026
PDF: Download PDF

[Paper] Learning Humanoid End-Effector Control for Open-Vocabulary Visual Loco-Manipulation

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] OpenEarthAgent: A Unified Framework for Tool-Augmented Geospatial Agents

[Paper] When Vision Overrides Language: Evaluating and Mitigating Counterfactual Failures in VLAs

[Paper] Human-level 3D shape perception emerges from multi-view learning

[Paper] Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting