[Paper] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions
Source: arXiv - 2602.06035v1
Overview
The paper introduces InterPrior, a new framework that teaches a generative controller to produce physically plausible whole‑body motions for humans interacting with objects. By combining large‑scale imitation learning with reinforcement‑learning fine‑tuning, the authors create a motion prior that can handle a wide variety of loco‑manipulation tasks—think picking up a cup, opening a door, or balancing on a moving platform—while staying grounded in physics.
Key Contributions
- Unified generative controller that learns from massive motion‑capture datasets and can be conditioned on high‑level intents (e.g., “grab”, “push”, “walk”).
- Goal‑conditioned variational policy that reconstructs multimodal observations (poses, contacts, object states) and high‑level commands.
- Physical data augmentation (perturbations, force injections) to expose the model to out‑of‑distribution situations during pre‑training.
- Reinforcement‑learning fine‑tuning that refines the distilled policy, improving robustness to unseen goals and initial states.
- Demonstrations of interactive control (real‑time user steering) and transfer to real robots, showing the model’s practical viability.
Methodology
-
Imitation Pre‑training
- Collect a large dataset of human‑object interaction clips (e.g., motion‑capture recordings of people walking while carrying items).
- Train a full‑reference expert (a high‑capacity model that sees the entire future trajectory) to imitate these clips.
- Distill this expert into a goal‑conditioned variational policy that only receives the current observation and a high‑level intent, learning a latent “skill space” that can reconstruct the original motions.
-
Physical Perturbation Augmentation
- Randomly apply forces, change object masses, or jitter joint positions during training.
- This forces the policy to learn how to recover from physically unrealistic states, expanding the reachable latent manifold.
-
Reinforcement‑Learning Fine‑tuning
- Define a reward that penalizes physics violations (e.g., interpenetration, loss of balance) and encourages task completion (e.g., reaching the target object).
- Use RL (e.g., PPO) to adjust the policy parameters, improving performance on unseen goals and novel object configurations.
-
Inference & Interaction
- At runtime, a developer supplies a high‑level command (e.g., “pick up the red box”) and optional constraints (desired hand position).
- The policy samples from its latent space to generate a full‑body trajectory that respects physics and the user’s intent.
Results & Findings
- Generalization: The fine‑tuned policy successfully handled objects and poses never seen during imitation training, outperforming baseline models that lacked RL fine‑tuning.
- Physical Coherence: Quantitative metrics (e.g., center‑of‑mass stability, contact forces) showed a 30 % reduction in balance violations compared to a purely imitation‑based model.
- Interactive Control: Real‑time user steering experiments demonstrated smooth transitions between intents without noticeable jitter or foot‑slipping.
- Robot Transfer: When deployed on a humanoid robot platform, the controller generated feasible joint commands that respected the robot’s torque limits, enabling tasks like “push a chair” and “lift a box” with minimal additional tuning.
Practical Implications
- Game & VR Development: InterPrior can serve as a plug‑and‑play motion prior for avatars that need to interact with dynamic environments, reducing the need for hand‑crafted animation blends.
- Robotics: Humanoid robots can leverage the learned prior to quickly acquire new manipulation skills without exhaustive task‑specific programming—useful for service robots in homes or warehouses.
- Simulation‑Based Training: Autonomous driving or crowd‑simulation pipelines can inject realistic human‑object interactions, improving safety validation and scenario diversity.
- Human‑Centric AI Assistants: Virtual assistants that need to demonstrate or predict human actions (e.g., AR coaching apps) can use the model to generate plausible whole‑body demonstrations on the fly.
Limitations & Future Work
- Dataset Bias: The model’s performance hinges on the diversity of the imitation dataset; rare or highly specialized interactions may still be under‑represented.
- Computational Cost: Real‑time inference on high‑DOF humanoids requires GPU acceleration, which may be a bottleneck for edge devices.
- Fine‑Grained Dexterity: While the framework handles gross loco‑manipulation well, fine hand‑finger manipulation (e.g., typing) remains outside its current scope.
- Future Directions: The authors suggest scaling to multi‑agent scenarios, integrating vision‑based perception for on‑the‑fly object detection, and exploring more sample‑efficient RL fine‑tuning methods.
Authors
- Sirui Xu
- Samuel Schulter
- Morteza Ziyadi
- Xialin He
- Xiaohan Fei
- Yu‑Xiong Wang
- Liangyan Gui
Paper Information
- arXiv ID: 2602.06035v1
- Categories: cs.CV, cs.GR, cs.RO
- Published: February 5, 2026
- PDF: Download PDF