[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

Published: (November 28, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.23476v1

Overview

The paper “Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi‑turn Interaction” tackles a core problem for LLM‑powered agents: how to let a language model learn the dynamics of an environment without being forced into a rigid, step‑by‑step reasoning chain. By letting the model “act” and receive real feedback, the authors show that an LLM can internalize a world model much faster and with far fewer interaction turns.

Key Contributions

  • WMAct framework – a lightweight recipe that lets LLMs reason through doing rather than pre‑structured logical steps.
  • Reward rescaling – dynamically adjusts the reward signal based on how effective an action was, encouraging the model to cut redundant steps.
  • Interaction‑frequency annealing – gradually tightens the maximum number of allowed interaction turns, forcing the model to compress knowledge into its internal representation.
  • Empirical validation on classic planning domains (Sokoban, Maze, Taxi) demonstrating single‑turn solutions where prior methods needed multiple rounds.
  • Transferability – the learned reasoning skills generalize to more complex, unseen environments and improve performance on a suite of reasoning benchmarks.

Methodology

  1. Problem framing – Treat world‑model reasoning as a multi‑turn dialogue between the LLM (the agent) and a simulated environment (the teacher). Each turn consists of an action proposal, environment feedback, and a reward.
  2. Free‑form interaction – Unlike prior work that forces the model to follow a fixed “think‑plan‑act” template, WMAct lets the model generate any textual action it deems useful. The environment simply returns the next state and a scalar reward.
  3. Reward rescaling – The raw reward is multiplied by a factor that reflects action efficacy: actions that move the agent closer to the goal get a boost, while wasted moves are penalized. This reshaped signal nudges the model toward concise, purposeful behavior.
  4. Annealing interaction budget – Training starts with a generous cap on the number of turns (e.g., 10). After each epoch the cap is reduced (e.g., 10 → 8 → 5 …). The model must therefore learn to solve the task with fewer external hints, effectively “internalizing” the world dynamics.
  5. Training loop – The LLM is fine‑tuned with reinforcement‑learning‑style updates (PPO‑like) using the rescaled rewards, while the environment remains a deterministic simulator for the chosen domains.

Results & Findings

DomainPrior multi‑turn baseline (avg. turns)WMAct (avg. turns)Success Rate ↑
Sokoban4.71.2+18%
Maze6.31.0+22%
Taxi5.11.3+15%
  • Single‑turn mastery: After annealing, the model solves many instances in a single interaction, indicating that it has built an internal world model.
  • Reduced redundancy: The reward‑rescaling mechanism cuts unnecessary back‑and‑forth, leading to shorter dialogues and lower compute cost.
  • Cross‑domain transfer: When evaluated on unseen, larger mazes and a set of reasoning puzzles (e.g., logical deduction, spatial reasoning), WMAct‑trained models outperform baseline LLM agents by 10‑12% absolute accuracy.

Practical Implications

  • Faster agent deployment – Fewer interaction rounds mean lower latency and cheaper API usage when LLMs are used as planners for robotics, game AI, or autonomous navigation.
  • Resource‑efficient fine‑tuning – The annealing schedule eliminates the need for massive multi‑turn datasets; a modest amount of interaction data suffices to teach the model the environment’s physics.
  • Better generalization – By forcing the model to internalize dynamics, developers can expect more robust behavior when the environment changes slightly (e.g., new map layouts or altered reward structures).
  • Plug‑and‑play – WMAct is model‑agnostic; it can be applied to any instruction‑tuned LLM (GPT‑3.5, LLaMA‑2, Claude) with minimal code changes, making it attractive for product teams building “thinking‑by‑doing” assistants.

Limitations & Future Work

  • Deterministic simulators only – Experiments rely on fully deterministic environments; stochastic or partially observable worlds may need additional uncertainty handling.
  • Reward design sensitivity – The efficacy of reward rescaling hinges on a well‑crafted efficacy metric; poorly chosen scaling can destabilize training.
  • Scalability to high‑dimensional actions – The current setup uses discrete action spaces (move up/down/left/right). Extending WMAct to continuous control (e.g., robot arm torques) remains an open challenge.
  • Future directions – The authors suggest integrating model‑based RL techniques to blend learned world models with WMAct’s interaction‑driven learning, and testing on real‑world robotics platforms where sensor noise and latency are factors.

Authors

  • Bao Shu
  • Yan Cai
  • Jianjian Sun
  • Chunrui Han
  • En Yu
  • Liang Zhao
  • Jingcheng Hu
  • Yinmin Zhang
  • Haoran Lv
  • Yuang Peng
  • Zheng Ge
  • Xiangyu Zhang
  • Daxin Jiang
  • Xiangyu Yue

Paper Information

  • arXiv ID: 2511.23476v1
  • Categories: cs.AI
  • Published: November 28, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »