[Paper] On the Limits of Innate Planning in Large Language Models
Source: arXiv - 2511.21591v1
Overview
Large language models (LLMs) have dazzled us with their ability to generate code, answer questions, and even solve puzzles—yet how well they plan on their own remains murky. This paper puts LLMs through the classic 8‑puzzle, a task that forces a model to keep track of a mutable board state and chart a path to a goal without any external computation. The authors discover that, despite clever prompting tricks, today’s LLMs still stumble on basic planning when left to their “innate” reasoning.
Key Contributions
- Systematic evaluation of planning using the 8‑puzzle as a clean, step‑by‑step benchmark for stateful reasoning.
- Comparison of four major LLMs (including GPT‑4‑class and open‑source alternatives) across three prompting styles: Zero‑Shot, Chain‑of‑Thought (CoT), and Algorithm‑of‑Thought (AoT).
- Tiered corrective feedback experiments that let the model revise its moves after being told they’re invalid.
- Introduction of an external “move validator” that supplies only legal moves, testing whether minimal tool assistance can bridge the gap.
- Qualitative analysis pinpointing two recurring failure modes: fragile internal state representation and weak heuristic planning that leads to loops or non‑progressive moves.
Methodology
- Task selection – The 8‑puzzle (sliding tiles on a 3×3 board) was chosen because every move can be verified, the optimal solution length is known, and the problem requires explicit state tracking.
- Prompting regimes –
- Zero‑Shot: a single instruction to solve the puzzle.
- Chain‑of‑Thought: the model is asked to “think out loud” and list intermediate board states.
- Algorithm‑of‑Thought: the prompt supplies a high‑level algorithmic skeleton (e.g., “while not solved, move the blank tile toward its target”).
- Feedback loops – After each generated move, the system checks validity. If the move is illegal, the model receives a corrective message and tries again, up to a fixed number of attempts.
- Move validator condition – An auxiliary module supplies only the set of legal moves for the current board, forcing the model to pick from a constrained action space.
- Metrics – Success rate (solved puzzles), average number of steps taken, and computational cost (tokens generated).
Results & Findings
- Baseline performance (no feedback) was low across the board: success rates hovered between 2 % and 9 % depending on model and prompting style.
- Corrective feedback boosted success for some combos (e.g., GPT‑4 with CoT rose to ~22 % solved), but the improvement came at the cost of many extra tokens and often involved long, indirect reasoning chains.
- Move validator—even when the model was handed only legal actions—failed to solve any puzzle. The models either repeated moves, entered loops, or chose actions that didn’t bring the board closer to the goal.
- Failure analysis revealed two dominant deficits:
- Brittle internal state – the model frequently “forgot” the current board configuration, leading to illegal moves.
- Weak heuristics – without an explicit search or distance metric, the model’s move choices were essentially random or even counter‑productive.
Practical Implications
- Tool‑augmented agents: Relying solely on an LLM’s internal reasoning for planning (e.g., autonomous agents navigating UI workflows) is risky. Adding external state trackers or search modules is essential.
- Prompt engineering limits: While CoT and AoT can coax better behavior, they cannot replace a systematic planning component. Developers should treat prompting as a guide, not a guarantee of correctness.
- Cost considerations: The token overhead of iterative feedback can quickly become prohibitive in production settings, especially for real‑time applications.
- Design of AI assistants: For tasks like code refactoring, UI automation, or game AI, integrating a lightweight planner (e.g., A* search) alongside the LLM yields more reliable outcomes than pure language‑only pipelines.
Limitations & Future Work
- Scope of tasks – The study focuses on a single, well‑understood puzzle; results may differ for domains with richer state representations.
- Model selection – Only four models were examined; newer or specialized planning‑oriented LLMs could behave differently.
- Feedback depth – The corrective loop was capped at a modest number of attempts; deeper iterative refinement might improve success but at higher cost.
- Future directions suggested by the authors include:
- Embedding explicit state variables within the LLM’s context (e.g., via structured prompts or memory modules).
- Coupling LLMs with classical search algorithms or differentiable planners.
- Exploring multi‑modal feedback (visual board snapshots) to strengthen state grounding.
Bottom line: LLMs are impressive storytellers, but when it comes to disciplined, step‑by‑step planning without external aids, they still fall short. For developers building autonomous systems, the takeaway is clear—pair language models with dedicated planning tools to achieve robust, real‑world performance.
Authors
- Charles Schepanowski
- Charles Ling
Paper Information
- arXiv ID: 2511.21591v1
- Categories: cs.AI
- Published: November 26, 2025
- PDF: Download PDF