[Paper] Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Published: 2 months ago (February 10, 2026 at 01:11 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10044v1

Overview

The paper “Optimistic World Models: Efficient Exploration in Model‑Based Deep Reinforcement Learning” tackles one of RL’s toughest problems—how to explore effectively when rewards are rare. By marrying a classic control idea (reward‑biased maximum‑likelihood estimation) with modern world‑model architectures, the authors propose a lightweight, gradient‑based way to make the agent optimistically imagine better futures, leading to faster learning and higher returns.

Key Contributions

Optimistic World Models (OWMs): A new framework that injects optimism directly into the dynamics learning loss, biasing imagined trajectories toward higher‑reward outcomes.
Plug‑and‑play design: OWMs require only a small modification to existing world‑model pipelines (no extra uncertainty estimators, no constrained optimization).
Two concrete instantiations:
- Optimistic DreamerV3 – built on the DreamerV3 architecture.
- Optimistic STORM – built on the STORM world‑model.
Empirical gains: Both variants achieve markedly better sample efficiency and cumulative reward on a suite of sparse‑reward benchmarks compared with their non‑optimistic baselines.
Theoretical grounding: Connects the method to reward‑biased maximum likelihood estimation (RBMLE) from adaptive control, providing a principled justification for the optimism bias.

Methodology

World‑model backbone: The agent learns a latent dynamics model (e.g., a recurrent state‑space model) that can generate imagined rollouts for planning.
Optimistic dynamics loss: Instead of the usual maximum‑likelihood loss that treats all observed transitions equally, OWMs add a reward‑biased term. The loss encourages the model to assign higher probability to transitions that lead to higher predicted rewards, effectively “stretching” the imagined future toward more promising states.
Gradient‑only update: The augmented loss is differentiable; the model parameters are updated with standard stochastic gradient descent, so the whole pipeline stays end‑to‑end trainable.
Integration with policy learning: The optimistic model is used to generate imagined trajectories that feed a policy/value network (as in DreamerV3 or STORM). Because the imagined rollouts are already skewed toward high‑reward outcomes, the policy naturally receives richer learning signals without any explicit exploration bonus.
Training loop: No extra uncertainty estimation (e.g., ensembles) or confidence‑bound calculations are needed—just the modified loss and the usual world‑model training schedule.

Results & Findings

Environment (sparse‑reward)	Baseline (DreamerV3 / STORM)	Optimistic Variant	Sample‑efficiency gain
Mini‑Grid (DoorKey)	45 % success after 1M steps	78 % success after 1M steps	+73 %
Atari (Montezuma’s Revenge)	0.3 % score after 2M frames	1.2 % score after 2M frames	+300 %
DeepMind Control (Sparse‑Cartpole)	150 reward	260 reward	+73 %

Cumulative return: Across all tasks, the optimistic versions consistently outperformed the baselines, often achieving the same performance with 30‑50 % fewer environment interactions.
Stability: Training curves showed smoother convergence, suggesting that the optimism bias also regularizes the model by focusing learning on reward‑relevant dynamics.
Ablation: Removing the optimistic term reverted performance to baseline levels, confirming that the improvement stems from the bias rather than incidental hyper‑parameter changes.

Practical Implications

Faster prototyping: Developers can plug OWMs into existing world‑model codebases (DreamerV3, STORM, etc.) with a single loss‑function tweak, cutting down the wall‑clock time needed to reach useful policies in sparse‑reward domains.
Reduced compute cost: Because OWMs avoid ensembles or explicit uncertainty estimation, they keep memory and compute footprints low—important for edge devices or large‑scale training pipelines.
Better exploration in safety‑critical settings: In robotics or autonomous systems where unsafe exploration is costly, an optimism‑biased model can steer imagined rollouts toward safe, high‑reward behaviors without needing risky real‑world trials.
Compatibility with downstream tools: The approach works with any downstream planner that consumes imagined trajectories (e.g., model‑predictive control, policy gradients), making it a versatile addition to the model‑based RL toolbox.

Limitations & Future Work

Bias‑variance trade‑off: Over‑optimism could cause the model to hallucinate unrealistic high‑reward states, especially in highly stochastic environments. The paper notes the need for mechanisms to temper the bias when the model’s predictive accuracy is low.
Sparse‑reward focus: Experiments concentrate on environments with very few rewards; performance gains in dense‑reward settings remain unclear.
Theoretical guarantees: While RBMLE provides a solid intuition, formal regret bounds for deep OWMs are not yet established.
Future directions: The authors suggest (1) adaptive scheduling of the optimism weight, (2) combining OWMs with uncertainty‑aware ensembles for robustness, and (3) extending the framework to multi‑agent and hierarchical RL scenarios.

Authors

Akshay Mete
Shahid Aamir Sheikh
Tzu‑Hsiang Lin
Dileep Kalathil
P. R. Kumar

Paper Information

arXiv ID: 2602.10044v1
Categories: cs.LG, cs.AI, eess.SY
Published: February 10, 2026
PDF: Download PDF

[Paper] Optimistic World Models: Efficient Exploration in Model-Based Deep Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Selection of CMIP6 Models for Regional Precipitation Projection and Climate Change Assessment in the Jhelum and Chenab River Basins