[Paper] ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Published: 2 days ago (December 8, 2025 at 01:26 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07795v1

Overview

Large language models (LLMs) are now being used for tasks that require step‑by‑step reasoning—think chain‑of‑thought prompts, math problem solving, or code generation. The new ReasonBENCH benchmark shines a light on a hidden problem: most papers report a single‑run accuracy, ignoring the fact that stochastic decoding can make the same prompt produce wildly different answers on different runs. This work quantifies that instability and offers a reproducible, variance‑aware evaluation framework for the community.

Key Contributions

First dedicated instability benchmark for LLM reasoning, covering multiple domains (math, commonsense, code, etc.).
Modular evaluation library that standardizes reasoning frameworks, model APIs, and task formats, making it easy to plug in new prompts or models.
Multi‑run protocol that automatically runs each experiment many times, computes confidence intervals, and reports cost‑adjusted metrics (tokens, latency, API price).
Public leaderboard that displays both mean performance and variability, encouraging researchers to publish variance‑aware results.
Empirical analysis showing that most reasoning strategies exhibit high variance; some methods with identical average scores differ by up to 4× in confidence‑interval width, and the highest‑scoring methods often have the least stable cost profiles.

Methodology

Task Suite – ReasonBENCH bundles a curated set of reasoning tasks (e.g., GSM‑8K math, ARC‑Easy, CodeEval) that require multi‑step inference.
Standardized Prompt Templates – For each task, the library provides a collection of prompting styles (plain, chain‑of‑thought, self‑consistency, etc.) so that comparisons are apples‑to‑apples.
Multi‑Run Execution – Every prompt‑model‑task combination is executed N times (default N = 30) using stochastic decoding settings (temperature > 0, top‑p sampling).
Statistical Reporting – The framework aggregates raw outputs into:
- Mean solve rate (accuracy or exact match).
- 95 % confidence interval for the solve rate, derived from the empirical distribution.
- Cost statistics (average token usage, API price, latency) with corresponding variance.
Leaderboard Integration – Results are automatically pushed to a public leaderboard that visualizes both central tendency and spread, making instability a first‑class metric.

Results & Findings

Ubiquitous Instability – Over 85 % of model‑prompt pairs show confidence intervals larger than 5 % of the mean accuracy, even on well‑studied benchmarks like GSM‑8K.
Trade‑off Between Performance and Stability – The top‑performing chain‑of‑thought + self‑consistency setups achieve the highest average scores but also exhibit the widest confidence intervals and the most variable token costs.
Prompt Sensitivity – Small wording changes (e.g., “Let’s think step‑by‑step” vs. “First, consider”) can swing the variance by up to 2×, highlighting the need for prompt‑level robustness checks.
Model Scale Effects – Larger models (e.g., GPT‑4‑level) tend to be more stable than smaller ones, but the improvement is not linear; some mid‑size models (e.g., LLaMA‑13B) are surprisingly erratic under the same decoding settings.
Cost Instability – Methods that invoke multiple sampling passes (self‑consistency) can double the average token usage, and the variance in cost can be four times higher than single‑pass baselines.

Practical Implications

Production‑Ready Deployments – Engineers should treat LLM reasoning outputs as probabilistic, not deterministic. Running a few sampled generations and aggregating (e.g., majority vote) can dramatically reduce failure rates.
Budget Forecasting – Because cost variance can be large, teams need to budget for worst‑case token usage rather than relying on mean estimates. ReasonBENCH’s cost‑aware metrics make this budgeting more transparent.
Prompt Engineering Pipelines – Automated prompt‑tuning should incorporate variance as an objective, not just mean accuracy. This leads to prompts that are both high‑performing and reliable across runs.
Model Selection – When choosing a model for a reasoning‑heavy product, consider the stability profile: a slightly lower‑accuracy but more stable model may yield a better user experience and lower operational costs.
Benchmarking Culture Shift – By publishing confidence intervals alongside scores, the community can better assess reproducibility, reduce “cherry‑picked” results, and accelerate the development of uncertainty‑aware reasoning methods.

Limitations & Future Work

Decoding Settings Only – The benchmark focuses on temperature‑based stochastic decoding; deterministic decoding (e.g., greedy) and alternative sampling strategies (e.g., nucleus vs. top‑k) deserve separate study.
Task Coverage – While ReasonBENCH spans several domains, it does not yet include long‑form reasoning (e.g., legal analysis) or multimodal tasks that combine text and images.
Scalability of Multi‑Run Experiments – Running 30+ samples per configuration can be costly for large commercial APIs; future work could explore variance estimation with fewer runs or adaptive sampling.
Uncertainty Quantification Techniques – The authors provide the benchmark but leave the development of model‑level uncertainty estimators (e.g., Bayesian LLMs) as an open research direction.

ReasonBENCH opens the door to a more honest, reproducible evaluation of LLM reasoning—something that developers, product teams, and researchers alike can start using today to build more reliable AI systems.

Authors

Nearchos Potamitis
Lars Klein
Akhil Arora

Paper Information

arXiv ID: 2512.07795v1
Categories: cs.AI, cs.CL, cs.LG
Published: December 8, 2025
PDF: Download PDF

[Paper] ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

[Paper] SCOPE: Language Models as One-Time Teacher for Hierarchical Planning in Text Environments

[Paper] MedForget: Hierarchy-Aware Multimodal Unlearning Testbed for Medical AI

[Paper] LLMs in Interpreting Legal Documents