[Paper] Coordinated Humanoid Manipulation with Choice Policies
Source: arXiv - 2512.25072v1
Overview
The paper introduces a new system that lets humanoid robots perform complex whole‑body tasks—like loading a dishwasher or wiping a whiteboard—by combining an intuitive tele‑operation interface with a novel imitation‑learning algorithm called Choice Policy. By breaking down robot control into modular sub‑tasks and learning from high‑quality human demonstrations, the authors achieve reliable coordination across the robot’s head, hands, and legs in real‑world, unstructured environments.
Key Contributions
- Modular tele‑operation framework that decomposes humanoid control into hand‑eye coordination, grasp primitives, arm tracking, and locomotion, enabling fast and scalable data collection.
- Choice Policy: an imitation‑learning architecture that generates multiple candidate actions, scores them, and selects the best one, efficiently handling multimodal behaviors.
- Empirical validation on two challenging real‑world tasks (dishwasher loading and whole‑body loco‑manipulation for whiteboard wiping), demonstrating superior performance over diffusion‑based policies and vanilla behavior cloning.
- Insightful analysis of hand‑eye coordination, showing its pivotal role in long‑horizon manipulation tasks for humanoids.
- Open‑source‑ready pipeline that can be adapted to other humanoid platforms and task families with minimal engineering effort.
Methodology
- Tele‑operation data collection – The robot is controlled through a set of intuitive interfaces: a VR headset for head orientation, a 6‑DOF controller for each hand, and a foot‑pad for locomotion. Operators execute sub‑tasks (e.g., “grasp cup”, “step forward”) while the system logs synchronized sensor data and robot joint states.
- Modular decomposition – Each sub‑task is treated as a separate “skill” with its own observation/action space, making it easier to capture clean demonstrations and to reuse skills across tasks.
- Choice Policy architecture
- Candidate generator: a lightweight neural network predicts a small set (e.g., 5‑10) of plausible next actions given the current observation.
- Scoring network: a second network evaluates each candidate using a learned value function that reflects how well the action aligns with the demonstrated behavior.
- Selection: the highest‑scoring candidate is executed, allowing fast inference (≈ 10 ms) while preserving the ability to express multimodal options (e.g., different grasp approaches).
- Training – The system is trained via supervised imitation learning on the collected demonstrations, with an auxiliary loss that encourages diversity among generated candidates.
Results & Findings
| Task | Metric (Success Rate) | Choice Policy | Diffusion Policy | Behavior Cloning |
|---|---|---|---|---|
| Dishwasher loading | Success Rate | 92 % | 78 % | 65 % |
| Whiteboard wiping (whole‑body) | Success Rate | 88 % | 71 % | 60 % |
- Higher success rates: Choice Policy consistently outperformed both diffusion‑based policies and standard behavior cloning across both tasks.
- Speed: Inference time per decision was ~10 ms for Choice Policy vs. 120 ms for diffusion models, enabling smoother real‑time control.
- Ablation of hand‑eye coordination: Removing the dedicated hand‑eye module dropped success rates by ~20 % on the dishwasher task, confirming its critical role.
- Robustness to disturbances: The policy could recover from minor pushes or unexpected object placements without re‑initializing the entire trajectory.
Practical Implications
- Scalable data pipelines: The modular tele‑operation setup lowers the barrier for collecting large, high‑quality datasets on any humanoid platform, accelerating research and product development.
- Real‑time deployment: The fast inference of Choice Policy makes it viable for on‑board execution on current compute‑constrained humanoids, opening doors to service‑robot applications in homes, hospitals, and offices.
- Multimodal decision making: By explicitly generating and scoring several actions, developers can embed safety checks or preference heuristics (e.g., energy efficiency, collision avoidance) into the scoring network.
- Transferability: Because skills are modular, a library of reusable primitives (grasp, step, turn head) can be assembled for new tasks, reducing the need for task‑specific retraining.
- Benchmark for whole‑body coordination: The paper’s experimental setup (dishwasher, whiteboard) provides a concrete benchmark that industry teams can adopt to evaluate their own humanoid controllers.
Limitations & Future Work
- Demonstration dependence: The system still relies on a sizable set of high‑quality tele‑operated demos; scaling to extremely diverse tasks may require further automation in data collection.
- Limited perception: The current pipeline uses relatively simple visual inputs (RGB‑D) and does not incorporate advanced scene understanding (e.g., semantic segmentation), which could improve robustness in cluttered environments.
- Generalization across robot morphologies: Experiments were performed on a single humanoid platform; adapting the approach to robots with different kinematics may need additional calibration.
- Future directions suggested by the authors include integrating self‑supervised perception modules, exploring hierarchical Choice Policies for longer‑horizon planning, and extending the framework to collaborative multi‑robot scenarios.
Authors
- Haozhi Qi
- Yen-Jen Wang
- Toru Lin
- Brent Yi
- Yi Ma
- Koushil Sreenath
- Jitendra Malik
Paper Information
- arXiv ID: 2512.25072v1
- Categories: cs.RO, cs.AI, cs.LG
- Published: December 31, 2025
- PDF: Download PDF