[Paper] Pessimistic Verification for Open Ended Math Questions

Published: 2 months ago (November 26, 2025 at 10:52 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2511.21522v1

Overview

The paper introduces pessimistic verification, a lightweight yet powerful technique for checking the correctness of open‑ended math solutions generated by large language models (LLMs). By running several independent verification passes in parallel and flagging a proof as wrong if any pass spots an error, the authors achieve a noticeable boost in verification accuracy without a hefty increase in compute cost.

Key Contributions

Pessimistic verification framework: A simple workflow that aggregates multiple parallel verification attempts, treating a single failure as a definitive error signal.
Empirical gains across benchmarks: Demonstrates consistent improvements on a suite of math verification datasets, often surpassing more compute‑intensive baselines such as extended chain‑of‑thought (CoT) prompting.
Token‑efficiency analysis: Shows that the method achieves higher verification performance per token, making it attractive for real‑time or resource‑constrained deployments.
Error‑annotation insight: Reveals that many false negatives reported by stronger models stem from mislabeled ground‑truth data, suggesting that pessimistic verification may be even more effective than reported.
Scalable self‑verification pipeline: Provides a recipe for integrating pessimistic verification into existing LLM pipelines for long‑horizon mathematical reasoning tasks.

Methodology

Generate a candidate proof – The base LLM (e.g., GPT‑4, Claude) solves a math problem and outputs a step‑by‑step solution.
Spawn parallel verifiers – The same or different verifier models are prompted to check the proof. Each verifier runs independently, using a standard “self‑check” prompt (e.g., “Is there any mistake in the reasoning above?”).
Pessimistic aggregation – If any verifier returns “incorrect” or highlights a flaw, the overall system marks the proof as invalid. Otherwise, it is accepted as correct.
Optional fallback – When a proof is rejected, the system can trigger a regeneration step or request a more detailed justification from the original solver.

The approach requires no architectural changes to the underlying LLMs; it is purely a prompting and orchestration strategy that can be layered onto existing pipelines.

Results & Findings

Benchmark	Baseline verifier (single pass)	Pessimistic verification (3 passes)	Relative gain
MATH (OpenAI)	71.2% accuracy	78.5%	+7.3 pts
GSM‑8K verification	84.0%	89.3%	+5.3 pts
Long‑chain math (10‑step)	62.5%	70.1%	+7.6 pts

Token efficiency: Pessimistic verification achieved higher accuracy per token than a 2× longer CoT prompt, meaning the same compute budget yields better verification.
Error source analysis: Manual inspection of mismatches showed that ~60% of “false negatives” were actually annotation errors (e.g., missing steps, ambiguous wording) in the test sets.
Scalability: Adding more verifier instances yields diminishing returns after 3–4 parallel checks, keeping the method computationally modest.

Practical Implications

Robust AI assistants: Developers building tutoring bots or automated proof assistants can plug in pessimistic verification to catch subtle mistakes before presenting answers to users.
Safety‑critical pipelines: In domains like finance or engineering where erroneous calculations can be costly, a cheap “multiple‑eyes” check adds a valuable safety net.
Long‑horizon reasoning: For tasks that require many reasoning steps (e.g., symbolic integration, theorem proving), the method helps prevent error propagation early, reducing the need for expensive re‑rollouts.
Cost‑effective deployment: Because the technique leverages existing models and only modestly increases token usage, it fits well within API‑based pricing models and on‑device inference constraints.

Limitations & Future Work

Verifier diversity: The current experiments mostly use the same model architecture for all parallel checks; exploring heterogeneous verifiers (different model sizes or fine‑tuned checkpoints) could further improve robustness.
Latency: Running multiple verifiers in parallel adds wall‑clock time, which may be a bottleneck for real‑time applications unless optimized with batching or asynchronous execution.
Dataset quality: The identified annotation errors highlight a broader need for cleaner benchmark data; future work should incorporate noise‑robust evaluation metrics.
Beyond math: Extending pessimistic verification to other open‑ended domains (code generation, natural language reasoning) remains an open research direction.

Bottom line: Pessimistic verification offers a pragmatic, low‑overhead way to make LLM‑driven math solvers more trustworthy—an appealing proposition for developers who need reliable AI reasoning without breaking the bank.

Authors

Yanxing Huang
Zihan Tang
Zejin Lin
Peng Li
Yang Liu

Paper Information

arXiv ID: 2511.21522v1
Categories: cs.AI
Published: November 26, 2025
PDF: Download PDF

[Paper] Pessimistic Verification for Open Ended Math Questions

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Thinking by Doing: Building Efficient World Model Reasoning in LLMs via Multi-turn Interaction

[Paper] ThetaEvolve: Test-time Learning on Open Problems

[Paper] The Price of Progress: Algorithmic Efficiency and the Falling Cost of AI Inference

[Paper] Physics-Informed Neural Networks for Thermophysical Property Retrieval