[Paper] ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

Published: (December 8, 2025 at 01:26 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.07795v1

Overview

Large language models (LLMs) are now being used for tasks that require step‑by‑step reasoning—think chain‑of‑thought prompts, math problem solving, or code generation. The new ReasonBENCH benchmark shines a light on a hidden problem: most papers report a single‑run accuracy, ignoring the fact that stochastic decoding can make the same prompt produce wildly different answers on different runs. This work quantifies that instability and offers a reproducible, variance‑aware evaluation framework for the community.

Key Contributions

  • First dedicated instability benchmark for LLM reasoning, covering multiple domains (math, commonsense, code, etc.).
  • Modular evaluation library that standardizes reasoning frameworks, model APIs, and task formats, making it easy to plug in new prompts or models.
  • Multi‑run protocol that automatically runs each experiment many times, computes confidence intervals, and reports cost‑adjusted metrics (tokens, latency, API price).
  • Public leaderboard that displays both mean performance and variability, encouraging researchers to publish variance‑aware results.
  • Empirical analysis showing that most reasoning strategies exhibit high variance; some methods with identical average scores differ by up to 4× in confidence‑interval width, and the highest‑scoring methods often have the least stable cost profiles.

Methodology

  1. Task Suite – ReasonBENCH bundles a curated set of reasoning tasks (e.g., GSM‑8K math, ARC‑Easy, CodeEval) that require multi‑step inference.
  2. Standardized Prompt Templates – For each task, the library provides a collection of prompting styles (plain, chain‑of‑thought, self‑consistency, etc.) so that comparisons are apples‑to‑apples.
  3. Multi‑Run Execution – Every prompt‑model‑task combination is executed N times (default N = 30) using stochastic decoding settings (temperature > 0, top‑p sampling).
  4. Statistical Reporting – The framework aggregates raw outputs into:
    • Mean solve rate (accuracy or exact match).
    • 95 % confidence interval for the solve rate, derived from the empirical distribution.
    • Cost statistics (average token usage, API price, latency) with corresponding variance.
  5. Leaderboard Integration – Results are automatically pushed to a public leaderboard that visualizes both central tendency and spread, making instability a first‑class metric.

Results & Findings

  • Ubiquitous Instability – Over 85 % of model‑prompt pairs show confidence intervals larger than 5 % of the mean accuracy, even on well‑studied benchmarks like GSM‑8K.
  • Trade‑off Between Performance and Stability – The top‑performing chain‑of‑thought + self‑consistency setups achieve the highest average scores but also exhibit the widest confidence intervals and the most variable token costs.
  • Prompt Sensitivity – Small wording changes (e.g., “Let’s think step‑by‑step” vs. “First, consider”) can swing the variance by up to 2×, highlighting the need for prompt‑level robustness checks.
  • Model Scale Effects – Larger models (e.g., GPT‑4‑level) tend to be more stable than smaller ones, but the improvement is not linear; some mid‑size models (e.g., LLaMA‑13B) are surprisingly erratic under the same decoding settings.
  • Cost Instability – Methods that invoke multiple sampling passes (self‑consistency) can double the average token usage, and the variance in cost can be four times higher than single‑pass baselines.

Practical Implications

  • Production‑Ready Deployments – Engineers should treat LLM reasoning outputs as probabilistic, not deterministic. Running a few sampled generations and aggregating (e.g., majority vote) can dramatically reduce failure rates.
  • Budget Forecasting – Because cost variance can be large, teams need to budget for worst‑case token usage rather than relying on mean estimates. ReasonBENCH’s cost‑aware metrics make this budgeting more transparent.
  • Prompt Engineering Pipelines – Automated prompt‑tuning should incorporate variance as an objective, not just mean accuracy. This leads to prompts that are both high‑performing and reliable across runs.
  • Model Selection – When choosing a model for a reasoning‑heavy product, consider the stability profile: a slightly lower‑accuracy but more stable model may yield a better user experience and lower operational costs.
  • Benchmarking Culture Shift – By publishing confidence intervals alongside scores, the community can better assess reproducibility, reduce “cherry‑picked” results, and accelerate the development of uncertainty‑aware reasoning methods.

Limitations & Future Work

  • Decoding Settings Only – The benchmark focuses on temperature‑based stochastic decoding; deterministic decoding (e.g., greedy) and alternative sampling strategies (e.g., nucleus vs. top‑k) deserve separate study.
  • Task Coverage – While ReasonBENCH spans several domains, it does not yet include long‑form reasoning (e.g., legal analysis) or multimodal tasks that combine text and images.
  • Scalability of Multi‑Run Experiments – Running 30+ samples per configuration can be costly for large commercial APIs; future work could explore variance estimation with fewer runs or adaptive sampling.
  • Uncertainty Quantification Techniques – The authors provide the benchmark but leave the development of model‑level uncertainty estimators (e.g., Bayesian LLMs) as an open research direction.

ReasonBENCH opens the door to a more honest, reproducible evaluation of LLM reasoning—something that developers, product teams, and researchers alike can start using today to build more reliable AI systems.

Authors

  • Nearchos Potamitis
  • Lars Klein
  • Akhil Arora

Paper Information

  • arXiv ID: 2512.07795v1
  • Categories: cs.AI, cs.CL, cs.LG
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »