[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction
Source: arXiv - 2511.23476v1
Overview
The paper “Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi‑turn Interaction” tackles a core problem for LLM‑powered agents: how to let a language model learn the dynamics of an environment without being forced into a rigid, step‑by‑step reasoning chain. By letting the model “act” and receive real feedback, the authors show that an LLM can internalize a world model much faster and with far fewer interaction turns.
Key Contributions
- WMAct framework – a lightweight recipe that lets LLMs reason through doing rather than pre‑structured logical steps.
- Reward rescaling – dynamically adjusts the reward signal based on how effective an action was, encouraging the model to cut redundant steps.
- Interaction‑frequency annealing – gradually tightens the maximum number of allowed interaction turns, forcing the model to compress knowledge into its internal representation.
- Empirical validation on classic planning domains (Sokoban, Maze, Taxi) demonstrating single‑turn solutions where prior methods needed multiple rounds.
- Transferability – the learned reasoning skills generalize to more complex, unseen environments and improve performance on a suite of reasoning benchmarks.
Methodology
- Problem framing – Treat world‑model reasoning as a multi‑turn dialogue between the LLM (the agent) and a simulated environment (the teacher). Each turn consists of an action proposal, environment feedback, and a reward.
- Free‑form interaction – Unlike prior work that forces the model to follow a fixed “think‑plan‑act” template, WMAct lets the model generate any textual action it deems useful. The environment simply returns the next state and a scalar reward.
- Reward rescaling – The raw reward is multiplied by a factor that reflects action efficacy: actions that move the agent closer to the goal get a boost, while wasted moves are penalized. This reshaped signal nudges the model toward concise, purposeful behavior.
- Annealing interaction budget – Training starts with a generous cap on the number of turns (e.g., 10). After each epoch the cap is reduced (e.g., 10 → 8 → 5 …). The model must therefore learn to solve the task with fewer external hints, effectively “internalizing” the world dynamics.
- Training loop – The LLM is fine‑tuned with reinforcement‑learning‑style updates (PPO‑like) using the rescaled rewards, while the environment remains a deterministic simulator for the chosen domains.
Results & Findings
| Domain | Prior multi‑turn baseline (avg. turns) | WMAct (avg. turns) | Success Rate ↑ |
|---|---|---|---|
| Sokoban | 4.7 | 1.2 | +18% |
| Maze | 6.3 | 1.0 | +22% |
| Taxi | 5.1 | 1.3 | +15% |
- Single‑turn mastery: After annealing, the model solves many instances in a single interaction, indicating that it has built an internal world model.
- Reduced redundancy: The reward‑rescaling mechanism cuts unnecessary back‑and‑forth, leading to shorter dialogues and lower compute cost.
- Cross‑domain transfer: When evaluated on unseen, larger mazes and a set of reasoning puzzles (e.g., logical deduction, spatial reasoning), WMAct‑trained models outperform baseline LLM agents by 10‑12% absolute accuracy.
Practical Implications
- Faster agent deployment – Fewer interaction rounds mean lower latency and cheaper API usage when LLMs are used as planners for robotics, game AI, or autonomous navigation.
- Resource‑efficient fine‑tuning – The annealing schedule eliminates the need for massive multi‑turn datasets; a modest amount of interaction data suffices to teach the model the environment’s physics.
- Better generalization – By forcing the model to internalize dynamics, developers can expect more robust behavior when the environment changes slightly (e.g., new map layouts or altered reward structures).
- Plug‑and‑play – WMAct is model‑agnostic; it can be applied to any instruction‑tuned LLM (GPT‑3.5, LLaMA‑2, Claude) with minimal code changes, making it attractive for product teams building “thinking‑by‑doing” assistants.
Limitations & Future Work
- Deterministic simulators only – Experiments rely on fully deterministic environments; stochastic or partially observable worlds may need additional uncertainty handling.
- Reward design sensitivity – The efficacy of reward rescaling hinges on a well‑crafted efficacy metric; poorly chosen scaling can destabilize training.
- Scalability to high‑dimensional actions – The current setup uses discrete action spaces (move up/down/left/right). Extending WMAct to continuous control (e.g., robot arm torques) remains an open challenge.
- Future directions – The authors suggest integrating model‑based RL techniques to blend learned world models with WMAct’s interaction‑driven learning, and testing on real‑world robotics platforms where sensor noise and latency are factors.
Authors
- Bao Shu
- Yan Cai
- Jianjian Sun
- Chunrui Han
- En Yu
- Liang Zhao
- Jingcheng Hu
- Yinmin Zhang
- Haoran Lv
- Yuang Peng
- Zheng Ge
- Xiangyu Zhang
- Daxin Jiang
- Xiangyu Yue
Paper Information
- arXiv ID: 2511.23476v1
- Categories: cs.AI
- Published: November 28, 2025
- PDF: Download PDF