[Paper] ImplicitRDP: An End-to-End Visual-Force Diffusion Policy with Structural Slow-Fast Learning
Source: arXiv - 2512.10946v1
Overview
The paper introduces ImplicitRDP, a single‑network diffusion policy that fuses visual perception and force feedback for contact‑rich robot manipulation. By treating vision as a “slow” global cue and force as a “fast” local cue, the authors devise a learning scheme that lets robots react at high force‑sensor frequencies while still planning coherent motion sequences—an advance that could make robot hands more reliable in real‑world assembly, insertion, and handling tasks.
Key Contributions
- Unified visual‑force diffusion policy that replaces the usual two‑stage (vision planner + force controller) pipeline.
- Structural Slow‑Fast Learning: a causal‑attention mechanism that processes asynchronous visual tokens (low‑rate) and force tokens (high‑rate) within the same transformer, preserving temporal coherence of action chunks while enabling rapid force‑level corrections.
- Virtual‑target Representation Regularization: an auxiliary loss that maps force feedback into the same latent space as the robot’s action, preventing the network from ignoring the force modality (modality collapse).
- End‑to‑end training on raw RGB‑D and force streams without handcrafted feature engineering or separate controllers.
- Empirical validation on several benchmark contact‑rich tasks (peg‑in‑hole, drawer opening, cable routing) showing higher success rates and lower latency compared with vision‑only and hierarchical baselines.
Methodology
1. Data Representation
- Visual tokens: extracted from a short video clip (e.g., 5 Hz) using a pretrained CNN encoder.
- Force tokens: sampled at the native sensor rate (≈100 Hz) and embedded with a lightweight MLP.
2. Slow‑Fast Transformer
- The model stacks two streams of tokens.
- Causal attention ensures that each force token can attend to all past visual tokens but not future ones, preserving the “slow” context while allowing “fast” reactive updates.
- The transformer outputs a diffusion latent that is subsequently denoised into a sequence of robot joint actions (action chunks).
3. Diffusion Policy
- A standard denoising diffusion probabilistic model (DDPM) generates smooth action trajectories from noisy latent samples.
- The diffusion process is conditioned on the combined visual‑force representation, enabling the policy to sample actions that respect both global geometry and instantaneous contact forces.
4. Virtual‑target Regularization
- An auxiliary network predicts a “virtual target” vector from the force embedding; this vector is forced to align (via L2 loss) with the action embedding produced by the diffusion decoder.
- The regularizer supplies a physics‑grounded gradient that encourages the policy to actually use force information rather than ignoring it.
5. Training
- Collected demonstrations (vision + force) train the whole system jointly with three losses: diffusion reconstruction, force‑to‑action regularization, and a small KL term for latent stability.
- No separate fine‑tuning of a force controller is required.
Results & Findings
| Task | Success Rate (ImplicitRDP) | Vision‑Only Baseline | Hierarchical (Vision + Force) |
|---|---|---|---|
| Peg‑in‑hole (tight tolerance) | 92 % | 68 % | 81 % |
| Drawer opening (variable friction) | 88 % | 55 % | 73 % |
| Cable routing (dynamic obstacles) | 84 % | 60 % | 77 % |
- Reactivity: ImplicitRDP reacts to force spikes within ≈10 ms, an order of magnitude faster than the vision‑only planner (≈100 ms).
- Smoothness: The diffusion decoder yields low‑jerk trajectories, reducing wear on hardware.
- Ablation: Removing the virtual‑target regularizer drops success by ~10 % and causes the model to ignore force inputs; disabling causal attention leads to unstable force corrections.
Overall, the unified policy outperforms both single‑modality and staged approaches while simplifying the training pipeline.
Practical Implications
- Simplified Stack: Developers can replace a complex hierarchy (vision planner → force controller) with a single model, cutting down integration effort and latency.
- Plug‑and‑Play Sensors: The architecture works with any off‑the‑shelf RGB‑D camera and standard 6‑DoF force/torque sensor, making it suitable for existing robotic arms.
- Higher Throughput: Faster reactive loops mean less cycle time for assembly lines, especially for tasks that involve insertion, fastening, or surface polishing where contact dynamics dominate.
- Robustness to Variability: Because the policy learns to fuse global context with local contact cues, it adapts better to part tolerances, surface finishes, and unexpected disturbances—key for flexible manufacturing and service robots.
- Open‑Source Release: The authors promise code and demo videos, enabling rapid prototyping and benchmarking in industry labs.
Limitations & Future Work
- Sensor Dependency: The method assumes synchronized, low‑latency force streaming; noisy or delayed force data could degrade performance.
- Scalability of Token Length: Very long visual histories increase transformer memory usage; the current implementation caps visual token windows at a few seconds.
- Generalization to New Tasks: While the model transfers well across the tested tasks, zero‑shot adaptation to entirely new contact dynamics (e.g., soft‑object manipulation) remains an open question.
- Future Directions: The authors suggest exploring multi‑modal extensions (e.g., tactile arrays), hierarchical diffusion for longer horizons, and curriculum‑based data collection to further improve robustness.
Authors
- Wendi Chen
- Han Xue
- Yi Wang
- Fangyuan Zhou
- Jun Lv
- Yang Jin
- Shirun Tang
- Chuan Wen
- Cewu Lu
Paper Information
- arXiv ID: 2512.10946v1
- Categories: cs.RO, cs.AI, cs.LG
- Published: December 11, 2025
- PDF: Download PDF