[Paper] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Published: (February 5, 2026 at 01:59 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2602.06035v1

Overview

The paper introduces InterPrior, a new framework that teaches a generative controller to produce physically plausible whole‑body motions for humans interacting with objects. By combining large‑scale imitation learning with reinforcement‑learning fine‑tuning, the authors create a motion prior that can handle a wide variety of loco‑manipulation tasks—think picking up a cup, opening a door, or balancing on a moving platform—while staying grounded in physics.

Key Contributions

  • Unified generative controller that learns from massive motion‑capture datasets and can be conditioned on high‑level intents (e.g., “grab”, “push”, “walk”).
  • Goal‑conditioned variational policy that reconstructs multimodal observations (poses, contacts, object states) and high‑level commands.
  • Physical data augmentation (perturbations, force injections) to expose the model to out‑of‑distribution situations during pre‑training.
  • Reinforcement‑learning fine‑tuning that refines the distilled policy, improving robustness to unseen goals and initial states.
  • Demonstrations of interactive control (real‑time user steering) and transfer to real robots, showing the model’s practical viability.

Methodology

  1. Imitation Pre‑training

    • Collect a large dataset of human‑object interaction clips (e.g., motion‑capture recordings of people walking while carrying items).
    • Train a full‑reference expert (a high‑capacity model that sees the entire future trajectory) to imitate these clips.
    • Distill this expert into a goal‑conditioned variational policy that only receives the current observation and a high‑level intent, learning a latent “skill space” that can reconstruct the original motions.
  2. Physical Perturbation Augmentation

    • Randomly apply forces, change object masses, or jitter joint positions during training.
    • This forces the policy to learn how to recover from physically unrealistic states, expanding the reachable latent manifold.
  3. Reinforcement‑Learning Fine‑tuning

    • Define a reward that penalizes physics violations (e.g., interpenetration, loss of balance) and encourages task completion (e.g., reaching the target object).
    • Use RL (e.g., PPO) to adjust the policy parameters, improving performance on unseen goals and novel object configurations.
  4. Inference & Interaction

    • At runtime, a developer supplies a high‑level command (e.g., “pick up the red box”) and optional constraints (desired hand position).
    • The policy samples from its latent space to generate a full‑body trajectory that respects physics and the user’s intent.

Results & Findings

  • Generalization: The fine‑tuned policy successfully handled objects and poses never seen during imitation training, outperforming baseline models that lacked RL fine‑tuning.
  • Physical Coherence: Quantitative metrics (e.g., center‑of‑mass stability, contact forces) showed a 30 % reduction in balance violations compared to a purely imitation‑based model.
  • Interactive Control: Real‑time user steering experiments demonstrated smooth transitions between intents without noticeable jitter or foot‑slipping.
  • Robot Transfer: When deployed on a humanoid robot platform, the controller generated feasible joint commands that respected the robot’s torque limits, enabling tasks like “push a chair” and “lift a box” with minimal additional tuning.

Practical Implications

  • Game & VR Development: InterPrior can serve as a plug‑and‑play motion prior for avatars that need to interact with dynamic environments, reducing the need for hand‑crafted animation blends.
  • Robotics: Humanoid robots can leverage the learned prior to quickly acquire new manipulation skills without exhaustive task‑specific programming—useful for service robots in homes or warehouses.
  • Simulation‑Based Training: Autonomous driving or crowd‑simulation pipelines can inject realistic human‑object interactions, improving safety validation and scenario diversity.
  • Human‑Centric AI Assistants: Virtual assistants that need to demonstrate or predict human actions (e.g., AR coaching apps) can use the model to generate plausible whole‑body demonstrations on the fly.

Limitations & Future Work

  • Dataset Bias: The model’s performance hinges on the diversity of the imitation dataset; rare or highly specialized interactions may still be under‑represented.
  • Computational Cost: Real‑time inference on high‑DOF humanoids requires GPU acceleration, which may be a bottleneck for edge devices.
  • Fine‑Grained Dexterity: While the framework handles gross loco‑manipulation well, fine hand‑finger manipulation (e.g., typing) remains outside its current scope.
  • Future Directions: The authors suggest scaling to multi‑agent scenarios, integrating vision‑based perception for on‑the‑fly object detection, and exploring more sample‑efficient RL fine‑tuning methods.

Authors

  • Sirui Xu
  • Samuel Schulter
  • Morteza Ziyadi
  • Xialin He
  • Xiaohan Fei
  • Yu‑Xiong Wang
  • Liangyan Gui

Paper Information

  • arXiv ID: 2602.06035v1
  • Categories: cs.CV, cs.GR, cs.RO
  • Published: February 5, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »