[Paper] WorldCompass: Reinforcement Learning for Long-Horizon World Models

Published: (February 9, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.09022v1

Overview

WorldCompass is a new reinforcement‑learning (RL) post‑training framework that upgrades long‑horizon, video‑based world models so they can explore and react to user interactions more reliably. By treating the world model as an autoregressive video generator and “steering” it with carefully designed rewards, the authors achieve noticeably higher interaction accuracy and visual quality—key steps toward more usable simulation and generative‑AI systems.

Key Contributions

  • Clip‑level rollout strategy – generates and evaluates many candidate video clips at a single target segment, dramatically improving rollout efficiency and delivering fine‑grained reward signals.
  • Dual‑objective reward design – combines an interaction‑following accuracy reward with a visual‑quality reward, providing direct supervision while curbing reward‑hacking.
  • Negative‑aware fine‑tuning RL algorithm – a lightweight RL update that penalizes undesirable generations and incorporates several efficiency tricks to keep training fast and memory‑friendly.
  • Demonstrated gains on WorldPlay – applying WorldCompass to the state‑of‑the‑art open‑source world model (WorldPlay) yields consistent boosts in both how well the model follows commands and the realism of the generated video.

Methodology

WorldCompass builds on top of an existing autoregressive video world model (e.g., WorldPlay). The workflow can be broken down into three intuitive steps:

  1. Clip‑level rollout – Instead of rolling out an entire long video frame‑by‑frame, the system samples a set of complete short clips (e.g., 2‑3 seconds) that start from the same context. Each clip is scored, allowing the RL loop to receive a dense, clip‑level reward rather than a sparse end‑of‑episode signal.
  2. Reward engineering
    • Interaction accuracy: measures how closely the generated clip follows the prescribed action sequence (e.g., “pick up the cup”).
    • Visual fidelity: uses perceptual metrics (e.g., LPIPS, frame‑level sharpness) to ensure the video remains realistic.
      The two rewards are summed with a weighting scheme that discourages the model from “gaming” one metric at the expense of the other.
  3. Negative‑aware fine‑tuning – A lightweight policy‑gradient update that explicitly penalizes clips with low visual quality or large interaction errors. The authors also integrate gradient caching, mixed‑precision training, and batch‑wise clip selection to keep the extra RL overhead modest.

All of this is performed after the base world model has already been trained, so developers can plug WorldCompass into any existing video‑generation pipeline without retraining from scratch.

Results & Findings

  • Interaction accuracy improves by ≈15‑20 % on a suite of benchmark tasks (e.g., object manipulation, navigation) compared to the vanilla WorldPlay model.
  • Visual quality (measured by LPIPS and user preference studies) sees a 10‑12 % uplift, with fewer artifacts such as flickering or unrealistic textures.
  • Efficiency: The clip‑level rollout reduces the number of forward passes needed for a full‑episode evaluation by ≈3×, and the RL fine‑tuning adds only ≈0.5‑1 % extra training time per epoch.
  • Ablation studies confirm that each component (clip‑level rollout, dual rewards, negative‑aware updates) contributes meaningfully; removing any of them drops performance back toward the baseline.

Practical Implications

  • Simulation & robotics – developers can embed WorldCompass‑enhanced models into virtual environments for more faithful robot‑policy testing, where accurate reaction to commands is critical.
  • Interactive media – game studios and VFX pipelines can generate longer, controllable video sequences that stay on script while maintaining cinematic quality.
  • Generative AI assistants – chat‑driven video generation tools can produce longer, instruction‑following clips without the “drift” that often plagues current models.
  • Plug‑and‑play upgrade – because WorldCompass works as a post‑training wrapper, existing world‑model deployments can be upgraded with minimal engineering effort and without re‑collecting massive video datasets.

Limitations & Future Work

  • Domain specificity – the experiments focus on indoor, object‑centric scenes; performance on outdoor or highly dynamic domains (e.g., sports) remains untested.
  • Reward balance tuning – selecting the right weighting between interaction and visual rewards still requires manual hyper‑parameter sweeps.
  • Scalability to very long horizons – while clip‑level rollouts speed up training, generating truly hour‑long coherent videos may need hierarchical planning extensions.
  • Future directions suggested by the authors include integrating language‑conditioned rewards, exploring multi‑agent interaction scenarios, and releasing a lightweight API for broader community adoption.

Authors

  • Zehan Wang
  • Tengfei Wang
  • Haiyu Zhang
  • Xuhui Zuo
  • Junta Wu
  • Haoyuan Wang
  • Wenqiang Sun
  • Zhenwei Wang
  • Chenjie Cao
  • Hengshuang Zhao
  • Chunchao Guo
  • Zhou Zhao

Paper Information

  • arXiv ID: 2602.09022v1
  • Categories: cs.CV
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »