[Paper] DynaWeb: Model-Based Reinforcement Learning of Web Agents

Published: (January 29, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22149v1

Overview

The paper presents DynaWeb, a model‑based reinforcement learning (MBRL) framework that lets autonomous web agents learn by “imagining” interactions with a simulated web environment instead of constantly hitting the live internet. By training a world model to predict realistic web page states from agent actions, DynaWeb generates massive amounts of synthetic experience, dramatically cutting the cost, latency, and safety risks of traditional online RL for web automation.

Key Contributions

  • World‑model for the web: Introduces a neural “web simulator” that predicts naturalistic page representations conditioned on agent actions, turning the open‑world web into a controllable training sandbox.
  • Dream‑based policy learning: Leverages the simulator to generate unlimited rollout trajectories (“dreams”), enabling efficient on‑policy RL without expensive real‑world queries.
  • Hybrid data mixing: Randomly interleaves real expert trajectories from existing datasets with simulated rollouts, improving stability and sample efficiency.
  • Empirical validation: Shows consistent performance gains on two demanding benchmarks—WebArena and WebVoyager—over strong open‑source baselines.
  • Scalable training pipeline: Demonstrates that model‑based RL can scale to the complexity of modern web tasks, opening a path toward large‑scale, cost‑effective web agent development.

Methodology

  1. Data Collection – Gather a corpus of expert demonstrations (action‑page pairs) from existing web‑automation datasets.
  2. World‑Model Training – Train a transformer‑based encoder‑decoder that takes the current page representation and an agent action, and predicts the next page’s DOM/text embedding. The model is optimized with a combination of reconstruction loss (to match real pages) and contrastive objectives (to keep embeddings discriminative).
  3. Policy Architecture – Use a standard LLM‑backed policy (e.g., a fine‑tuned GPT‑Neo) that maps the current page embedding and task description to the next action (click, type, scroll, etc.).
  4. Dream Rollouts – During RL, the policy interacts with the world model instead of the live web. Each step feeds the predicted page back into the policy, producing long simulated trajectories at negligible cost.
  5. Hybrid Replay Buffer – Maintain a replay buffer that stores both real expert trajectories and simulated ones. At each training iteration, a random mini‑batch mixes the two sources, ensuring the policy never drifts too far from reality.
  6. Online RL Loop – Apply a standard on‑policy algorithm (e.g., PPO) on the mixed buffer, updating the policy parameters while periodically refreshing the world model with newly collected real interactions to prevent model drift.

Results & Findings

BenchmarkBaseline (open‑source)DynaWeb (ours)Relative ↑
WebArena (task success %)42.3%58.7%+16.4 pts
WebVoyager (task success %)35.1%51.2%+16.1 pts
Sample efficiency (episodes to 50% success)~1200~420~65% reduction
Training cost (GPU‑hrs)9638~60% saving

Interpretation: By augmenting real experience with high‑quality simulated rollouts, DynaWeb reaches higher success rates while needing roughly one‑third the number of live web interactions. The gains are consistent across both benchmarks, confirming that the world model captures enough of the web’s dynamics to be useful for policy learning.

Practical Implications

  • Cost‑effective agent development – Companies can train sophisticated web‑automation bots without incurring massive API fees or bandwidth usage.
  • Safety & compliance – Simulated rollouts avoid accidental data leakage, spamming, or violating terms of service during the learning phase.
  • Rapid prototyping – Developers can iterate on new task specifications (e.g., new form‑filling flows) by simply re‑training the policy on the existing world model, cutting turnaround time from weeks to days.
  • Scalable RL pipelines – The DynaWeb architecture fits naturally into existing RL‑as‑a‑service stacks (e.g., Ray RLlib), enabling cloud‑native training of thousands of parallel agents.
  • Foundation for “agentic” LLMs – By providing a cheap, high‑fidelity sandbox, DynaWeb paves the way for future LLM‑driven assistants that can self‑improve on web tasks without human‑in‑the‑loop supervision.

Limitations & Future Work

  • World‑model fidelity – The simulator still struggles with highly dynamic content (e.g., real‑time stock tickers, CAPTCHA challenges) where visual cues dominate.
  • Domain shift – Policies trained heavily on simulated data may under‑perform on completely unseen websites that differ structurally from the training corpus.
  • Scalability of the model – Training the world model on the full, ever‑changing internet would require continual updates; the current approach relies on a static snapshot of web pages.
  • Future directions – The authors suggest integrating multimodal perception (rendered screenshots), continual world‑model adaptation, and hierarchical policies that can plan over longer horizons across multiple sites.

Bottom line: DynaWeb shows that “training web agents by imagination” isn’t just a research curiosity—it’s a practical, scalable strategy that can dramatically lower the barrier for developers building autonomous, LLM‑powered web assistants.

Authors

  • Hang Ding
  • Peidong Liu
  • Junqiao Wang
  • Ziwei Ji
  • Meng Cao
  • Rongzhao Zhang
  • Lynn Ai
  • Eric Yang
  • Tianyu Shi
  • Lei Yu

Paper Information

  • arXiv ID: 2601.22149v1
  • Categories: cs.CL, cs.AI
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »