[Paper] Dex4D: Task-Agnostic Point Track Policy for Sim-to-Real Dexterous Manipulation

Published: (February 17, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.15828v1

Overview

Dex4D tackles one of the toughest problems in robotics: teaching a dexterous hand to handle any object in any pose without hand‑crafting task‑specific simulators or reward functions. By training a single “any‑pose‑to‑any‑pose” policy in simulation and then feeding it real‑world point‑track cues, the authors achieve zero‑shot transfer to a wide variety of manipulation tasks.

Key Contributions

  • Task‑agnostic 3‑D point‑track policy – a single neural controller that can move any object from an arbitrary start pose to a desired target pose, conditioned only on a stream of 3‑D points.
  • Massive, diverse simulation curriculum – training on thousands of procedurally generated objects and pose pairs to cover a broad interaction space.
  • Zero‑shot sim‑to‑real transfer – no fine‑tuning on real hardware; the policy works out‑of‑the‑box when supplied with point tracks extracted from a short video of the target motion.
  • Closed‑loop perception via online point tracking – the robot continuously updates its belief about the object’s pose, enabling robust execution under visual noise and occlusions.
  • Extensive empirical validation – experiments on a Shadow‑Hand‑like platform demonstrate consistent gains over prior sim‑to‑real baselines across dozens of tasks, objects, and scene variations.

Methodology

  1. Simulation data generation
    • Randomly sample 3‑D meshes (≈10k objects) and spawn them in a physics engine.
    • For each object, generate many start/goal pose pairs and compute a dense set of surface points (the “track”).
  2. Policy architecture
    • Input: a sequence of 3‑D point clouds representing the desired trajectory (the goal track) and the current observed point cloud (the current track).
    • Backbone: a point‑net‑style encoder that extracts a latent representation of the object’s geometry and motion intent.
    • Output: joint torques for the 24‑DoF dexterous hand, produced by a lightweight MLP decoder.
  3. Training objective
    • Reinforcement learning with a dense reward that penalizes deviation from the target point track and encourages smooth, stable hand motions.
    • Domain randomization (lighting, textures, friction) to bridge the sim‑to‑real gap.
  4. Real‑world deployment
    • Record a short video of the desired object motion (or synthesize it).
    • Run an off‑the‑shelf 3‑D point tracker (e.g., DeepLabCut‑3D, OpenPose‑3D) to extract the goal point track.
    • Feed the live point cloud from an RGB‑D sensor and the goal track to the policy, which then drives the hand in closed‑loop.

Results & Findings

SettingSuccess Rate (Goal Pose Reached)Compared Baseline
Simulated 50 random objects87 %62 % (task‑specific policies)
Real‑world 10 novel objects (no fine‑tuning)71 %48 % (domain‑randomized baselines)
Varying background / lighting±5 % drop from nominalBaselines drop > 20 %
Trajectory length up to 1 mMaintained > 65 % successBaselines fail > 30 %

Key takeaways: the single policy generalizes across object shapes, textures, and even unseen scene layouts. Online point tracking keeps the controller stable despite occlusions, and the zero‑shot pipeline eliminates the costly real‑world data collection loop.

Practical Implications

  • Rapid prototyping of manipulation tasks – developers can script a new pick‑and‑place or reorientation task simply by providing a short demonstration video, without writing custom reward functions or simulation environments.
  • Scalable data‑efficiency – training once in simulation replaces weeks of tele‑operated data collection, cutting R&D costs dramatically.
  • Plug‑and‑play hardware integration – the approach works with any dexterous hand that can be driven by torque commands and equipped with an RGB‑D sensor, making it suitable for research labs and emerging commercial kits.
  • Foundation for higher‑level planners – the “any‑pose‑to‑any‑pose” primitive can be composed by task planners (e.g., ROS MoveIt extensions) to solve multi‑step assembly or tool‑use scenarios.

Limitations & Future Work

  • Reliance on accurate 3‑D point tracking – failure modes arise when the tracker loses points due to severe occlusion or reflective surfaces.
  • Torque‑level control only – the method assumes low‑level torque controllers; adapting to position‑control APIs may need additional calibration.
  • Object rigidity assumption – deformable or articulated objects are out of scope.
  • Future directions suggested by the authors include integrating tactile feedback for finer grasp adjustments, extending the policy to handle multi‑object interactions, and exploring self‑supervised refinement on real hardware to push performance beyond zero‑shot levels.

Authors

  • Yuxuan Kuang
  • Sungjae Park
  • Katerina Fragkiadaki
  • Shubham Tulsiani

Paper Information

  • arXiv ID: 2602.15828v1
  • Categories: cs.RO, cs.CV, cs.LG
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »