[Paper] ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning
Source: arXiv - 2512.07795v1
Overview
Large language models (LLMs) are now being used for tasks that require step‑by‑step reasoning—think chain‑of‑thought prompts, math problem solving, or code generation. The new ReasonBENCH benchmark shines a light on a hidden problem: most papers report a single‑run accuracy, ignoring the fact that stochastic decoding can make the same prompt produce wildly different answers on different runs. This work quantifies that instability and offers a reproducible, variance‑aware evaluation framework for the community.
Key Contributions
- First dedicated instability benchmark for LLM reasoning, covering multiple domains (math, commonsense, code, etc.).
- Modular evaluation library that standardizes reasoning frameworks, model APIs, and task formats, making it easy to plug in new prompts or models.
- Multi‑run protocol that automatically runs each experiment many times, computes confidence intervals, and reports cost‑adjusted metrics (tokens, latency, API price).
- Public leaderboard that displays both mean performance and variability, encouraging researchers to publish variance‑aware results.
- Empirical analysis showing that most reasoning strategies exhibit high variance; some methods with identical average scores differ by up to 4× in confidence‑interval width, and the highest‑scoring methods often have the least stable cost profiles.
Methodology
- Task Suite – ReasonBENCH bundles a curated set of reasoning tasks (e.g., GSM‑8K math, ARC‑Easy, CodeEval) that require multi‑step inference.
- Standardized Prompt Templates – For each task, the library provides a collection of prompting styles (plain, chain‑of‑thought, self‑consistency, etc.) so that comparisons are apples‑to‑apples.
- Multi‑Run Execution – Every prompt‑model‑task combination is executed N times (default N = 30) using stochastic decoding settings (temperature > 0, top‑p sampling).
- Statistical Reporting – The framework aggregates raw outputs into:
- Mean solve rate (accuracy or exact match).
- 95 % confidence interval for the solve rate, derived from the empirical distribution.
- Cost statistics (average token usage, API price, latency) with corresponding variance.
- Leaderboard Integration – Results are automatically pushed to a public leaderboard that visualizes both central tendency and spread, making instability a first‑class metric.
Results & Findings
- Ubiquitous Instability – Over 85 % of model‑prompt pairs show confidence intervals larger than 5 % of the mean accuracy, even on well‑studied benchmarks like GSM‑8K.
- Trade‑off Between Performance and Stability – The top‑performing chain‑of‑thought + self‑consistency setups achieve the highest average scores but also exhibit the widest confidence intervals and the most variable token costs.
- Prompt Sensitivity – Small wording changes (e.g., “Let’s think step‑by‑step” vs. “First, consider”) can swing the variance by up to 2×, highlighting the need for prompt‑level robustness checks.
- Model Scale Effects – Larger models (e.g., GPT‑4‑level) tend to be more stable than smaller ones, but the improvement is not linear; some mid‑size models (e.g., LLaMA‑13B) are surprisingly erratic under the same decoding settings.
- Cost Instability – Methods that invoke multiple sampling passes (self‑consistency) can double the average token usage, and the variance in cost can be four times higher than single‑pass baselines.
Practical Implications
- Production‑Ready Deployments – Engineers should treat LLM reasoning outputs as probabilistic, not deterministic. Running a few sampled generations and aggregating (e.g., majority vote) can dramatically reduce failure rates.
- Budget Forecasting – Because cost variance can be large, teams need to budget for worst‑case token usage rather than relying on mean estimates. ReasonBENCH’s cost‑aware metrics make this budgeting more transparent.
- Prompt Engineering Pipelines – Automated prompt‑tuning should incorporate variance as an objective, not just mean accuracy. This leads to prompts that are both high‑performing and reliable across runs.
- Model Selection – When choosing a model for a reasoning‑heavy product, consider the stability profile: a slightly lower‑accuracy but more stable model may yield a better user experience and lower operational costs.
- Benchmarking Culture Shift – By publishing confidence intervals alongside scores, the community can better assess reproducibility, reduce “cherry‑picked” results, and accelerate the development of uncertainty‑aware reasoning methods.
Limitations & Future Work
- Decoding Settings Only – The benchmark focuses on temperature‑based stochastic decoding; deterministic decoding (e.g., greedy) and alternative sampling strategies (e.g., nucleus vs. top‑k) deserve separate study.
- Task Coverage – While ReasonBENCH spans several domains, it does not yet include long‑form reasoning (e.g., legal analysis) or multimodal tasks that combine text and images.
- Scalability of Multi‑Run Experiments – Running 30+ samples per configuration can be costly for large commercial APIs; future work could explore variance estimation with fewer runs or adaptive sampling.
- Uncertainty Quantification Techniques – The authors provide the benchmark but leave the development of model‑level uncertainty estimators (e.g., Bayesian LLMs) as an open research direction.
ReasonBENCH opens the door to a more honest, reproducible evaluation of LLM reasoning—something that developers, product teams, and researchers alike can start using today to build more reliable AI systems.
Authors
- Nearchos Potamitis
- Lars Klein
- Akhil Arora
Paper Information
- arXiv ID: 2512.07795v1
- Categories: cs.AI, cs.CL, cs.LG
- Published: December 8, 2025
- PDF: Download PDF