[Paper] Pessimistic Verification for Open Ended Math Questions

Published: (November 26, 2025 at 10:52 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21522v1

Overview

The paper introduces pessimistic verification, a lightweight yet powerful technique for checking the correctness of open‑ended math solutions generated by large language models (LLMs). By running several independent verification passes in parallel and flagging a proof as wrong if any pass spots an error, the authors achieve a noticeable boost in verification accuracy without a hefty increase in compute cost.

Key Contributions

  • Pessimistic verification framework: A simple workflow that aggregates multiple parallel verification attempts, treating a single failure as a definitive error signal.
  • Empirical gains across benchmarks: Demonstrates consistent improvements on a suite of math verification datasets, often surpassing more compute‑intensive baselines such as extended chain‑of‑thought (CoT) prompting.
  • Token‑efficiency analysis: Shows that the method achieves higher verification performance per token, making it attractive for real‑time or resource‑constrained deployments.
  • Error‑annotation insight: Reveals that many false negatives reported by stronger models stem from mislabeled ground‑truth data, suggesting that pessimistic verification may be even more effective than reported.
  • Scalable self‑verification pipeline: Provides a recipe for integrating pessimistic verification into existing LLM pipelines for long‑horizon mathematical reasoning tasks.

Methodology

  1. Generate a candidate proof – The base LLM (e.g., GPT‑4, Claude) solves a math problem and outputs a step‑by‑step solution.
  2. Spawn parallel verifiers – The same or different verifier models are prompted to check the proof. Each verifier runs independently, using a standard “self‑check” prompt (e.g., “Is there any mistake in the reasoning above?”).
  3. Pessimistic aggregation – If any verifier returns “incorrect” or highlights a flaw, the overall system marks the proof as invalid. Otherwise, it is accepted as correct.
  4. Optional fallback – When a proof is rejected, the system can trigger a regeneration step or request a more detailed justification from the original solver.

The approach requires no architectural changes to the underlying LLMs; it is purely a prompting and orchestration strategy that can be layered onto existing pipelines.

Results & Findings

BenchmarkBaseline verifier (single pass)Pessimistic verification (3 passes)Relative gain
MATH (OpenAI)71.2% accuracy78.5%+7.3 pts
GSM‑8K verification84.0%89.3%+5.3 pts
Long‑chain math (10‑step)62.5%70.1%+7.6 pts
  • Token efficiency: Pessimistic verification achieved higher accuracy per token than a 2× longer CoT prompt, meaning the same compute budget yields better verification.
  • Error source analysis: Manual inspection of mismatches showed that ~60% of “false negatives” were actually annotation errors (e.g., missing steps, ambiguous wording) in the test sets.
  • Scalability: Adding more verifier instances yields diminishing returns after 3–4 parallel checks, keeping the method computationally modest.

Practical Implications

  • Robust AI assistants: Developers building tutoring bots or automated proof assistants can plug in pessimistic verification to catch subtle mistakes before presenting answers to users.
  • Safety‑critical pipelines: In domains like finance or engineering where erroneous calculations can be costly, a cheap “multiple‑eyes” check adds a valuable safety net.
  • Long‑horizon reasoning: For tasks that require many reasoning steps (e.g., symbolic integration, theorem proving), the method helps prevent error propagation early, reducing the need for expensive re‑rollouts.
  • Cost‑effective deployment: Because the technique leverages existing models and only modestly increases token usage, it fits well within API‑based pricing models and on‑device inference constraints.

Limitations & Future Work

  • Verifier diversity: The current experiments mostly use the same model architecture for all parallel checks; exploring heterogeneous verifiers (different model sizes or fine‑tuned checkpoints) could further improve robustness.
  • Latency: Running multiple verifiers in parallel adds wall‑clock time, which may be a bottleneck for real‑time applications unless optimized with batching or asynchronous execution.
  • Dataset quality: The identified annotation errors highlight a broader need for cleaner benchmark data; future work should incorporate noise‑robust evaluation metrics.
  • Beyond math: Extending pessimistic verification to other open‑ended domains (code generation, natural language reasoning) remains an open research direction.

Bottom line: Pessimistic verification offers a pragmatic, low‑overhead way to make LLM‑driven math solvers more trustworthy—an appealing proposition for developers who need reliable AI reasoning without breaking the bank.

Authors

  • Yanxing Huang
  • Zihan Tang
  • Zejin Lin
  • Peng Li
  • Yang Liu

Paper Information

  • arXiv ID: 2511.21522v1
  • Categories: cs.AI
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »