[Paper] Generate, Transfer, Adapt: Learning Functional Dexterous Grasping from a Single Human Demonstration
Source: arXiv - 2601.05243v1
Overview
A single human demonstration can now teach a dexterous robot hand to grasp a whole family of objects for functional use. The paper introduces CorDex, a pipeline that synthesizes rich training data from that one demo, learns a multimodal grasp predictor, and delivers reliable functional grasps on unseen items—bridging the gap between scarce real‑world data and the need for semantic‑geometric reasoning in robot manipulation.
Key Contributions
- Correspondence‑based data engine: Generates diverse synthetic objects from a single human grasp, transfers the expert grasp via shape correspondence, and refines it with optimization.
- Multimodal prediction network: Fuses visual (RGB‑D) and geometric (point‑cloud) cues through a novel local‑global fusion module.
- Importance‑aware sampling: Prioritizes high‑impact contact regions during inference, cutting computation while preserving accuracy.
- One‑shot learning: Demonstrates that functional dexterous grasps for an entire object category can be learned from just one human demonstration.
- State‑of‑the‑art performance: Outperforms prior functional grasping methods on multiple benchmark categories in both simulation and real‑world robot experiments.
Methodology
-
Data Generation from a Single Demo
- The human provides a single functional grasp (e.g., holding a mug by its handle).
- A procedural generator creates many synthetic objects belonging to the same semantic category (different mug shapes, sizes, and textures).
- For each synthetic object, a correspondence estimator aligns its geometry to the demonstrated object, transferring the hand pose onto the new shape.
- An optimization step tweaks finger joint angles to resolve interpenetrations and improve grasp stability, yielding a high‑quality labeled dataset (object mesh + functional grasp).
-
Learning the Grasp Predictor
- Input: RGB‑D image of the target object + a sampled point cloud of its surface.
- Local‑global fusion module extracts fine‑grained local features (e.g., handle curvature) and aggregates them with global context (object category, overall shape).
- The network outputs a set of candidate hand poses; an importance‑aware sampler ranks them by predicted functional relevance, allowing the system to evaluate only the most promising candidates.
-
Inference on Novel Objects
- Given a new, unseen object, the model predicts a functional dexterous grasp in real time, which can be directly executed on a robotic hand (e.g., Shadow Hand, Allegro Hand) after a brief kinematic validation.
Results & Findings
- Generalization: Trained on ~2 k synthetic grasps derived from a single demo, CorDex achieved >85 % success rate on 10 unseen object instances per category (mugs, scissors, hammers, etc.), compared to 60–70 % for the strongest baselines.
- Efficiency: The importance‑aware sampler reduced inference time from ~120 ms (full candidate set) to ~35 ms with no loss in success rate, enabling near‑real‑time operation.
- Ablation studies confirmed that both the correspondence‑based data engine and the local‑global fusion module contribute roughly 10–12 % each to the overall performance boost.
- Real‑world validation: On a physical Shadow Hand mounted on a UR5 arm, the system successfully performed functional tasks (e.g., pouring from a mug, cutting with scissors) in >80 % of trials, despite variations in lighting and object texture.
Practical Implications
- Rapid prototyping of robot skills: Developers can bootstrap functional grasping for a new tool class with just one human demonstration, dramatically cutting data collection time.
- Scalable tool‑use libraries: Manufacturing or warehouse robots can expand their repertoire of manipulable objects without exhaustive manual labeling—simply feed a few human demos and let CorDex synthesize the rest.
- Integration with existing pipelines: The multimodal predictor can be dropped into ROS‑based manipulation stacks, feeding grasp poses to motion planners that already handle collision checking and trajectory generation.
- Cost‑effective simulation‑to‑real transfer: By leveraging synthetic data grounded in real human grasps, the approach reduces the reliance on expensive tele‑operation or motion‑capture setups for dataset creation.
Limitations & Future Work
- Simulation fidelity: The synthetic objects are generated procedurally; highly irregular or deformable items (e.g., soft fabrics) remain challenging.
- Single‑demo bias: While effective for many categories, the method assumes the human demo captures the essential functional contact; ambiguous tasks may need multiple demonstrations.
- Hardware constraints: The current implementation targets high‑DOF anthropomorphic hands; adapting to simpler grippers may require redesigning the correspondence transfer step.
- Future directions suggested by the authors include extending the correspondence engine to handle deformable objects, incorporating tactile feedback for closed‑loop refinement, and scaling the framework to learn multi‑step manipulation sequences beyond a single grasp.
Authors
- Xingyi He
- Adhitya Polavaram
- Yunhao Cao
- Om Deshmukh
- Tianrui Wang
- Xiaowei Zhou
- Kuan Fang
Paper Information
- arXiv ID: 2601.05243v1
- Categories: cs.RO, cs.CV
- Published: January 8, 2026
- PDF: Download PDF