[Paper] LLM CHESS: Benchmarking Reasoning and Instruction-Following in LLMs through Chess
Source: arXiv - 2512.01992v1
Overview
The paper presents LLM CHESS, a new benchmark that tests how well large language models (LLMs) can reason and follow instructions in a dynamic, interactive setting—specifically, by playing chess against a random opponent. By turning the classic board game into a multi‑turn “agentic” task, the authors expose gaps in current models’ ability to plan, stay consistent, and avoid hallucinated moves, offering a more realistic gauge of real‑world reasoning performance.
Key Contributions
- A novel evaluation framework that turns chess into a step‑by‑step instruction‑following challenge for LLMs.
- Comprehensive behavioral metrics (win/loss rate, move legality, move quality, hallucinated actions, game length) that go beyond static accuracy scores.
- Leaderboard & Elo‑style scoring for over 50 open‑ and closed‑source models, enabling easy comparison with traditional chess engines.
- Evidence of a reasoning vs. non‑reasoning split among state‑of‑the‑art models, even when the opponent is deliberately weak.
- Open‑source release of the full experimental pipeline, game dataset, and evaluation scripts to foster reproducibility and future research.
Methodology
- Game Setup – Each LLM plays as White against a “random” opponent that selects legal moves uniformly at random. This keeps the baseline opponent simple while still requiring the model to generate a coherent, legal sequence of moves.
- Prompt Design – The model receives a concise instruction (“Make your next move in standard algebraic notation”) along with the current board state expressed in Forsyth‑Edwards Notation (FEN). After each move, the board is updated and fed back to the model.
- Metric Collection – For every turn the framework logs:
- Legality – whether the suggested move is legal.
- Quality – evaluated by a Stockfish engine (depth‑2) to score move strength.
- Hallucinations – any output that does not correspond to a valid move (e.g., prose, unrelated text).
- Game Duration – number of moves before termination (win, loss, draw, or illegal move).
- Elo Estimation – For the top‑performing models, the authors pit them against a configurable Stockfish engine (varying skill levels) and compute an Elo rating, translating raw win/loss data into a familiar competitive metric.
- Ranking & Leaderboard – All models are ranked on a public leaderboard that aggregates the above metrics, allowing quick visual comparison.
Results & Findings
- Wide performance spread – Among 50+ models, only a handful (e.g., GPT‑4, Claude‑2, LLaMA‑2‑70B‑Chat) consistently produce legal moves and achieve positive win rates.
- Reasoning models excel – Models explicitly trained for chain‑of‑thought or tool‑use (e.g., those with “reasoning” prompts) outperform vanilla instruction‑following models, confirming the benchmark’s sensitivity to reasoning capabilities.
- Hallucination persists – Even top models occasionally output non‑move text, causing premature game termination.
- Elo scores reveal gaps – The best LLMs sit around 1500–1700 Elo, comparable to a low‑intermediate human player, while many models fall below 1000, indicating frequent illegal or nonsensical moves.
- Dynamic nature prevents overfitting – Because each game evolves based on the model’s prior actions, memorizing a static dataset is ineffective, and performance does not saturate quickly.
Practical Implications
- Tool‑augmented agents – Developers building LLM‑powered assistants that must plan multi‑step actions (e.g., code generation pipelines, autonomous bots) can use LLM CHESS as a proxy to gauge how well their models will handle sequential decision making.
- Safety & reliability checks – The hallucination metric highlights failure modes that could translate to real‑world risks (e.g., issuing invalid API calls). Integrating similar checks can improve system robustness.
- Benchmark for fine‑tuning – The open framework enables practitioners to fine‑tune models on the chess interaction loop, potentially yielding better planning abilities for non‑gaming domains such as workflow automation or strategic game AI.
- Elo‑style reporting – Translating LLM performance into Elo scores gives product managers an intuitive way to communicate model competence to stakeholders, similar to how AI‑driven game bots are evaluated.
Limitations & Future Work
- Simplistic opponent – Using a random mover may not fully stress a model’s strategic depth; stronger opponents could expose additional weaknesses.
- Domain specificity – Chess rules are well‑defined; transferring insights to less formal domains (e.g., natural language planning) may require additional validation.
- Compute cost – Running many games for large models is resource‑intensive, limiting rapid iteration for smaller teams.
- Future directions – The authors suggest extending the benchmark to other turn‑based environments (e.g., Go, real‑time strategy games) and incorporating tool‑use (e.g., invoking external engines) to study hybrid reasoning pipelines.
Authors
- Sai Kolasani
- Maxim Saplin
- Nicholas Crispino
- Kyle Montgomery
- Jared Quincy Davis
- Matei Zaharia
- Chi Wang
- Chenguang Wang
Paper Information
- arXiv ID: 2512.01992v1
- Categories: cs.AI, cs.CL
- Published: December 1, 2025
- PDF: Download PDF