[Paper] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Published: 3 days ago (February 5, 2026 at 01:59 PM EST)

3 min read

Source: arXiv

Source: arXiv - 2602.06035v1

Overview

The paper introduces InterPrior, a new framework that teaches a generative controller to produce physically plausible whole‑body motions for humans interacting with objects. By combining large‑scale imitation learning with reinforcement‑learning fine‑tuning, the authors create a motion prior that can handle a wide variety of loco‑manipulation tasks—think picking up a cup, opening a door, or balancing on a moving platform—while staying grounded in physics.

Key Contributions

Unified generative controller that learns from massive motion‑capture datasets and can be conditioned on high‑level intents (e.g., “grab”, “push”, “walk”).
Goal‑conditioned variational policy that reconstructs multimodal observations (poses, contacts, object states) and high‑level commands.
Physical data augmentation (perturbations, force injections) to expose the model to out‑of‑distribution situations during pre‑training.
Reinforcement‑learning fine‑tuning that refines the distilled policy, improving robustness to unseen goals and initial states.
Demonstrations of interactive control (real‑time user steering) and transfer to real robots, showing the model’s practical viability.

Methodology

Imitation Pre‑training
- Collect a large dataset of human‑object interaction clips (e.g., motion‑capture recordings of people walking while carrying items).
- Train a full‑reference expert (a high‑capacity model that sees the entire future trajectory) to imitate these clips.
- Distill this expert into a goal‑conditioned variational policy that only receives the current observation and a high‑level intent, learning a latent “skill space” that can reconstruct the original motions.
Physical Perturbation Augmentation
- Randomly apply forces, change object masses, or jitter joint positions during training.
- This forces the policy to learn how to recover from physically unrealistic states, expanding the reachable latent manifold.
Reinforcement‑Learning Fine‑tuning
- Define a reward that penalizes physics violations (e.g., interpenetration, loss of balance) and encourages task completion (e.g., reaching the target object).
- Use RL (e.g., PPO) to adjust the policy parameters, improving performance on unseen goals and novel object configurations.
Inference & Interaction
- At runtime, a developer supplies a high‑level command (e.g., “pick up the red box”) and optional constraints (desired hand position).
- The policy samples from its latent space to generate a full‑body trajectory that respects physics and the user’s intent.

Results & Findings

Generalization: The fine‑tuned policy successfully handled objects and poses never seen during imitation training, outperforming baseline models that lacked RL fine‑tuning.
Physical Coherence: Quantitative metrics (e.g., center‑of‑mass stability, contact forces) showed a 30 % reduction in balance violations compared to a purely imitation‑based model.
Interactive Control: Real‑time user steering experiments demonstrated smooth transitions between intents without noticeable jitter or foot‑slipping.
Robot Transfer: When deployed on a humanoid robot platform, the controller generated feasible joint commands that respected the robot’s torque limits, enabling tasks like “push a chair” and “lift a box” with minimal additional tuning.

Practical Implications

Game & VR Development: InterPrior can serve as a plug‑and‑play motion prior for avatars that need to interact with dynamic environments, reducing the need for hand‑crafted animation blends.
Robotics: Humanoid robots can leverage the learned prior to quickly acquire new manipulation skills without exhaustive task‑specific programming—useful for service robots in homes or warehouses.
Simulation‑Based Training: Autonomous driving or crowd‑simulation pipelines can inject realistic human‑object interactions, improving safety validation and scenario diversity.
Human‑Centric AI Assistants: Virtual assistants that need to demonstrate or predict human actions (e.g., AR coaching apps) can use the model to generate plausible whole‑body demonstrations on the fly.

Limitations & Future Work

Dataset Bias: The model’s performance hinges on the diversity of the imitation dataset; rare or highly specialized interactions may still be under‑represented.
Computational Cost: Real‑time inference on high‑DOF humanoids requires GPU acceleration, which may be a bottleneck for edge devices.
Fine‑Grained Dexterity: While the framework handles gross loco‑manipulation well, fine hand‑finger manipulation (e.g., typing) remains outside its current scope.
Future Directions: The authors suggest scaling to multi‑agent scenarios, integrating vision‑based perception for on‑the‑fly object detection, and exploring more sample‑efficient RL fine‑tuning methods.

Authors

Sirui Xu
Samuel Schulter
Morteza Ziyadi
Xialin He
Xiaohan Fei
Yu‑Xiong Wang
Liangyan Gui

Paper Information

arXiv ID: 2602.06035v1
Categories: cs.CV, cs.GR, cs.RO
Published: February 5, 2026
PDF: Download PDF

[Paper] InterPrior: Scaling Generative Control for Physics-Based Human-Object Interactions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Pseudo-Invertible Neural Networks

[Paper] Shared LoRA Subspaces for almost Strict Continual Learning

[Paper] Predicting Camera Pose from Perspective Descriptions for Spatial Reasoning

[Paper] SwimBird: Eliciting Switchable Reasoning Mode in Hybrid Autoregressive MLLMs