[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

Published: 2 months ago (November 28, 2025 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.23476v1

Overview

The paper “Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi‑turn Interaction” tackles a core problem for LLM‑powered agents: how to let a language model learn the dynamics of an environment without being forced into a rigid, step‑by‑step reasoning chain. By letting the model “act” and receive real feedback, the authors show that an LLM can internalize a world model much faster and with far fewer interaction turns.

Key Contributions

WMAct framework – a lightweight recipe that lets LLMs reason through doing rather than pre‑structured logical steps.
Reward rescaling – dynamically adjusts the reward signal based on how effective an action was, encouraging the model to cut redundant steps.
Interaction‑frequency annealing – gradually tightens the maximum number of allowed interaction turns, forcing the model to compress knowledge into its internal representation.
Empirical validation on classic planning domains (Sokoban, Maze, Taxi) demonstrating single‑turn solutions where prior methods needed multiple rounds.
Transferability – the learned reasoning skills generalize to more complex, unseen environments and improve performance on a suite of reasoning benchmarks.

Methodology

Problem framing – Treat world‑model reasoning as a multi‑turn dialogue between the LLM (the agent) and a simulated environment (the teacher). Each turn consists of an action proposal, environment feedback, and a reward.
Free‑form interaction – Unlike prior work that forces the model to follow a fixed “think‑plan‑act” template, WMAct lets the model generate any textual action it deems useful. The environment simply returns the next state and a scalar reward.
Reward rescaling – The raw reward is multiplied by a factor that reflects action efficacy: actions that move the agent closer to the goal get a boost, while wasted moves are penalized. This reshaped signal nudges the model toward concise, purposeful behavior.
Annealing interaction budget – Training starts with a generous cap on the number of turns (e.g., 10). After each epoch the cap is reduced (e.g., 10 → 8 → 5 …). The model must therefore learn to solve the task with fewer external hints, effectively “internalizing” the world dynamics.
Training loop – The LLM is fine‑tuned with reinforcement‑learning‑style updates (PPO‑like) using the rescaled rewards, while the environment remains a deterministic simulator for the chosen domains.

Results & Findings

Domain	Prior multi‑turn baseline (avg. turns)	WMAct (avg. turns)	Success Rate ↑
Sokoban	4.7	1.2	+18%
Maze	6.3	1.0	+22%
Taxi	5.1	1.3	+15%

Single‑turn mastery: After annealing, the model solves many instances in a single interaction, indicating that it has built an internal world model.
Reduced redundancy: The reward‑rescaling mechanism cuts unnecessary back‑and‑forth, leading to shorter dialogues and lower compute cost.
Cross‑domain transfer: When evaluated on unseen, larger mazes and a set of reasoning puzzles (e.g., logical deduction, spatial reasoning), WMAct‑trained models outperform baseline LLM agents by 10‑12% absolute accuracy.

Practical Implications

Faster agent deployment – Fewer interaction rounds mean lower latency and cheaper API usage when LLMs are used as planners for robotics, game AI, or autonomous navigation.
Resource‑efficient fine‑tuning – The annealing schedule eliminates the need for massive multi‑turn datasets; a modest amount of interaction data suffices to teach the model the environment’s physics.
Better generalization – By forcing the model to internalize dynamics, developers can expect more robust behavior when the environment changes slightly (e.g., new map layouts or altered reward structures).
Plug‑and‑play – WMAct is model‑agnostic; it can be applied to any instruction‑tuned LLM (GPT‑3.5, LLaMA‑2, Claude) with minimal code changes, making it attractive for product teams building “thinking‑by‑doing” assistants.

Limitations & Future Work

Deterministic simulators only – Experiments rely on fully deterministic environments; stochastic or partially observable worlds may need additional uncertainty handling.
Reward design sensitivity – The efficacy of reward rescaling hinges on a well‑crafted efficacy metric; poorly chosen scaling can destabilize training.
Scalability to high‑dimensional actions – The current setup uses discrete action spaces (move up/down/left/right). Extending WMAct to continuous control (e.g., robot arm torques) remains an open challenge.
Future directions – The authors suggest integrating model‑based RL techniques to blend learned world models with WMAct’s interaction‑driven learning, and testing on real‑world robotics platforms where sensor noise and latency are factors.

Authors

Bao Shu
Yan Cai
Jianjian Sun
Chunrui Han
En Yu
Liang Zhao
Jingcheng Hu
Yinmin Zhang
Haoran Lv
Yuang Peng
Zheng Ge
Xiangyu Zhang
Daxin Jiang
Xiangyu Yue

Paper Information

arXiv ID: 2511.23476v1
Categories: cs.AI
Published: November 28, 2025
PDF: Download PDF

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval

[Paper] ASTRO: Adaptive Stitching via Dynamics-Guided Trajectory Rollouts