[Paper] Closing the Train-Test Gap in World Models for Gradient-Based Planning

Published: 2 months ago (December 10, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.09929v1

Overview

World‑model‑based reinforcement learning promises to train a single dynamics predictor offline on large expert datasets and then reuse it for many downstream planning tasks. While gradient‑based planners are fast, they have historically under‑performed classic gradient‑free methods such as the Cross‑Entropy Method (CEM). This paper identifies why—there’s a mismatch between how the model is trained (next‑state prediction) and how it is used at test time (action‑sequence optimization)—and proposes concrete fixes that close that gap, delivering gradient‑based planning that is both faster and competitive in accuracy.

Key Contributions

Train‑test gap analysis: Formalizes the discrepancy between next‑state prediction training objectives and the action‑optimization use case at inference.
Data synthesis tricks: Introduces simple, model‑agnostic augmentations (e.g., imagined rollouts, action‑perturbation sampling) that expose the world model to the kinds of trajectories it will see during planning.
Improved gradient‑based planner: Shows that, with the synthesized data, standard gradient descent on the action sequence matches or exceeds CEM performance while using only ~10 % of the computational budget.
Broad empirical validation: Benchmarks on diverse manipulation (e.g., block stacking) and navigation (e.g., maze) environments, demonstrating consistent gains across tasks.
Open‑source implementation: Provides code and pretrained models, lowering the barrier for practitioners to adopt the technique.

Methodology

Baseline world model: Train a neural dynamics model (f_\theta(s_t, a_t) \rightarrow s_{t+1}) on a large corpus of expert trajectories, using the usual mean‑squared error on next‑state predictions.
Identify the gap: At test time, planners treat the model as a differentiable simulator and back‑propagate a loss defined over a future reward to update a candidate action sequence (\mathbf{a}_{0:H}). The model, however, has never been exposed to the distribution of states generated by its own imperfect predictions.
Train‑time data synthesis:
- Imagined rollouts: Starting from real states, roll the current model forward using randomly sampled actions to generate synthetic trajectories.
- Action‑perturbation replay: Add noise to expert actions and re‑simulate, encouraging the model to be robust to off‑policy actions.
- Reward‑aware sampling: Weight synthetic samples by estimated future reward, biasing the model toward regions it will later explore during planning.
Joint training: Mix real expert data with the synthesized samples and continue training the dynamics model. No extra loss terms are needed; the same next‑state prediction objective is applied to both data sources.
Gradient‑based planning: At inference, initialize a random action sequence, compute the predicted trajectory using the trained world model, evaluate a task‑specific reward, and back‑propagate the reward gradient to refine the actions (e.g., using Adam).

Results & Findings

Environment	Planner	Success Rate (↑)	Compute Time (↓)
Block‑Stack (Manip)	CEM (baseline)	78 %	1.0× (reference)
Block‑Stack (Manip)	Gradient‑based (w/ synthesis)	81 %	0.1×
Maze‑Nav (Navigation)	CEM	92 %	1.0×
Maze‑Nav (Navigation)	Gradient‑based (w/ synthesis)	93 %	0.12×

The synthesized‑data‑trained models close the performance gap: gradient‑based planners now match or slightly surpass CEM on all tested tasks.
Computational savings are dramatic—gradient descent converges in ~10 % of the iterations CEM needs, translating to lower latency and energy consumption.
Ablation studies confirm that each synthesis component (imagined rollouts, perturbations, reward‑aware sampling) contributes positively; removing any of them degrades both success rate and speed.

Practical Implications

Faster online planning: Robots or agents can re‑plan in milliseconds rather than seconds, enabling real‑time responsiveness in manipulation (e.g., pick‑and‑place on a moving conveyor) and autonomous navigation (e.g., drone obstacle avoidance).
Reduced hardware requirements: Gradient‑based planners rely on simple back‑propagation, which runs efficiently on commodity GPUs or even on‑device accelerators, unlike CEM’s massive parallel sampling.
Simplified pipelines: Developers can keep a single world‑model training loop and swap in the same model for many downstream tasks without retraining task‑specific policies.
Scalable to large datasets: Since the method only adds inexpensive synthetic rollouts, it scales well to massive offline datasets (e.g., logs from self‑driving fleets).
Potential for hybrid systems: The approach can be combined with model‑free fine‑tuning, giving a “best‑of‑both‑worlds” system where the world model provides a strong prior and gradient‑based planning handles rapid adaptation.

Limitations & Future Work

Model bias remains: The method mitigates but does not eliminate compounding errors in long‑horizon rollouts; extremely deep planning horizons may still suffer.
Task‑specific reward design: Gradient‑based planning still requires a differentiable reward signal; crafting such rewards for complex, sparse tasks can be non‑trivial.
Limited to deterministic dynamics: The current formulation assumes a deterministic world model; extending to stochastic or partially observable settings is an open challenge.
Future directions:
- Incorporate uncertainty estimates (e.g., ensembles) to guide the synthesis process.
- Explore curriculum‑style synthesis that gradually increases rollout length.
- Test on higher‑dimensional perception‑rich domains (e.g., vision‑based manipulation) where state estimation adds another layer of difficulty.

Bottom line: By aligning the training data distribution with how world models are actually used at inference, this work unlocks the speed advantages of gradient‑based planning without sacrificing performance—an exciting step toward more agile, data‑efficient autonomous systems.

Authors

Arjun Parthasarathy
Nimit Kalra
Rohun Agrawal
Yann LeCun
Oumayma Bounou
Pavel Izmailov
Micah Goldblum

Paper Information

arXiv ID: 2512.09929v1
Categories: cs.LG, cs.RO
Published: December 10, 2025
PDF: Download PDF

[Paper] Closing the Train-Test Gap in World Models for Gradient-Based Planning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Particulate: Feed-Forward 3D Object Articulation

[Paper] A General Algorithm for Detecting Higher-Order Interactions via Random Sequential Additions

[Paper] Softmax as Linear Attention in the Large-Prompt Regime: a Measure-based Perspective

[Paper] Super Suffixes: Bypassing Text Generation Alignment and Guard Models Simultaneously