[Paper] Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Published: (February 10, 2026 at 01:11 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10044v1

Overview

The paper “Optimistic World Models: Efficient Exploration in Model‑Based Deep Reinforcement Learning” tackles one of RL’s toughest problems—how to explore effectively when rewards are rare. By marrying a classic control idea (reward‑biased maximum‑likelihood estimation) with modern world‑model architectures, the authors propose a lightweight, gradient‑based way to make the agent optimistically imagine better futures, leading to faster learning and higher returns.

Key Contributions

  • Optimistic World Models (OWMs): A new framework that injects optimism directly into the dynamics learning loss, biasing imagined trajectories toward higher‑reward outcomes.
  • Plug‑and‑play design: OWMs require only a small modification to existing world‑model pipelines (no extra uncertainty estimators, no constrained optimization).
  • Two concrete instantiations:
    • Optimistic DreamerV3 – built on the DreamerV3 architecture.
    • Optimistic STORM – built on the STORM world‑model.
  • Empirical gains: Both variants achieve markedly better sample efficiency and cumulative reward on a suite of sparse‑reward benchmarks compared with their non‑optimistic baselines.
  • Theoretical grounding: Connects the method to reward‑biased maximum likelihood estimation (RBMLE) from adaptive control, providing a principled justification for the optimism bias.

Methodology

  1. World‑model backbone: The agent learns a latent dynamics model (e.g., a recurrent state‑space model) that can generate imagined rollouts for planning.
  2. Optimistic dynamics loss: Instead of the usual maximum‑likelihood loss that treats all observed transitions equally, OWMs add a reward‑biased term. The loss encourages the model to assign higher probability to transitions that lead to higher predicted rewards, effectively “stretching” the imagined future toward more promising states.
  3. Gradient‑only update: The augmented loss is differentiable; the model parameters are updated with standard stochastic gradient descent, so the whole pipeline stays end‑to‑end trainable.
  4. Integration with policy learning: The optimistic model is used to generate imagined trajectories that feed a policy/value network (as in DreamerV3 or STORM). Because the imagined rollouts are already skewed toward high‑reward outcomes, the policy naturally receives richer learning signals without any explicit exploration bonus.
  5. Training loop: No extra uncertainty estimation (e.g., ensembles) or confidence‑bound calculations are needed—just the modified loss and the usual world‑model training schedule.

Results & Findings

Environment (sparse‑reward)Baseline (DreamerV3 / STORM)Optimistic VariantSample‑efficiency gain
Mini‑Grid (DoorKey)45 % success after 1M steps78 % success after 1M steps+73 %
Atari (Montezuma’s Revenge)0.3 % score after 2M frames1.2 % score after 2M frames+300 %
DeepMind Control (Sparse‑Cartpole)150 reward260 reward+73 %
  • Cumulative return: Across all tasks, the optimistic versions consistently outperformed the baselines, often achieving the same performance with 30‑50 % fewer environment interactions.
  • Stability: Training curves showed smoother convergence, suggesting that the optimism bias also regularizes the model by focusing learning on reward‑relevant dynamics.
  • Ablation: Removing the optimistic term reverted performance to baseline levels, confirming that the improvement stems from the bias rather than incidental hyper‑parameter changes.

Practical Implications

  • Faster prototyping: Developers can plug OWMs into existing world‑model codebases (DreamerV3, STORM, etc.) with a single loss‑function tweak, cutting down the wall‑clock time needed to reach useful policies in sparse‑reward domains.
  • Reduced compute cost: Because OWMs avoid ensembles or explicit uncertainty estimation, they keep memory and compute footprints low—important for edge devices or large‑scale training pipelines.
  • Better exploration in safety‑critical settings: In robotics or autonomous systems where unsafe exploration is costly, an optimism‑biased model can steer imagined rollouts toward safe, high‑reward behaviors without needing risky real‑world trials.
  • Compatibility with downstream tools: The approach works with any downstream planner that consumes imagined trajectories (e.g., model‑predictive control, policy gradients), making it a versatile addition to the model‑based RL toolbox.

Limitations & Future Work

  • Bias‑variance trade‑off: Over‑optimism could cause the model to hallucinate unrealistic high‑reward states, especially in highly stochastic environments. The paper notes the need for mechanisms to temper the bias when the model’s predictive accuracy is low.
  • Sparse‑reward focus: Experiments concentrate on environments with very few rewards; performance gains in dense‑reward settings remain unclear.
  • Theoretical guarantees: While RBMLE provides a solid intuition, formal regret bounds for deep OWMs are not yet established.
  • Future directions: The authors suggest (1) adaptive scheduling of the optimism weight, (2) combining OWMs with uncertainty‑aware ensembles for robustness, and (3) extending the framework to multi‑agent and hierarchical RL scenarios.

Authors

  • Akshay Mete
  • Shahid Aamir Sheikh
  • Tzu‑Hsiang Lin
  • Dileep Kalathil
  • P. R. Kumar

Paper Information

  • arXiv ID: 2602.10044v1
  • Categories: cs.LG, cs.AI, eess.SY
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »