[Paper] Pessimistic Verification for Open Ended Math Questions
Source: arXiv - 2511.21522v1
Overview
The paper introduces pessimistic verification, a lightweight yet powerful technique for checking the correctness of open‑ended math solutions generated by large language models (LLMs). By running several independent verification passes in parallel and flagging a proof as wrong if any pass spots an error, the authors achieve a noticeable boost in verification accuracy without a hefty increase in compute cost.
Key Contributions
- Pessimistic verification framework: A simple workflow that aggregates multiple parallel verification attempts, treating a single failure as a definitive error signal.
- Empirical gains across benchmarks: Demonstrates consistent improvements on a suite of math verification datasets, often surpassing more compute‑intensive baselines such as extended chain‑of‑thought (CoT) prompting.
- Token‑efficiency analysis: Shows that the method achieves higher verification performance per token, making it attractive for real‑time or resource‑constrained deployments.
- Error‑annotation insight: Reveals that many false negatives reported by stronger models stem from mislabeled ground‑truth data, suggesting that pessimistic verification may be even more effective than reported.
- Scalable self‑verification pipeline: Provides a recipe for integrating pessimistic verification into existing LLM pipelines for long‑horizon mathematical reasoning tasks.
Methodology
- Generate a candidate proof – The base LLM (e.g., GPT‑4, Claude) solves a math problem and outputs a step‑by‑step solution.
- Spawn parallel verifiers – The same or different verifier models are prompted to check the proof. Each verifier runs independently, using a standard “self‑check” prompt (e.g., “Is there any mistake in the reasoning above?”).
- Pessimistic aggregation – If any verifier returns “incorrect” or highlights a flaw, the overall system marks the proof as invalid. Otherwise, it is accepted as correct.
- Optional fallback – When a proof is rejected, the system can trigger a regeneration step or request a more detailed justification from the original solver.
The approach requires no architectural changes to the underlying LLMs; it is purely a prompting and orchestration strategy that can be layered onto existing pipelines.
Results & Findings
| Benchmark | Baseline verifier (single pass) | Pessimistic verification (3 passes) | Relative gain |
|---|---|---|---|
| MATH (OpenAI) | 71.2% accuracy | 78.5% | +7.3 pts |
| GSM‑8K verification | 84.0% | 89.3% | +5.3 pts |
| Long‑chain math (10‑step) | 62.5% | 70.1% | +7.6 pts |
- Token efficiency: Pessimistic verification achieved higher accuracy per token than a 2× longer CoT prompt, meaning the same compute budget yields better verification.
- Error source analysis: Manual inspection of mismatches showed that ~60% of “false negatives” were actually annotation errors (e.g., missing steps, ambiguous wording) in the test sets.
- Scalability: Adding more verifier instances yields diminishing returns after 3–4 parallel checks, keeping the method computationally modest.
Practical Implications
- Robust AI assistants: Developers building tutoring bots or automated proof assistants can plug in pessimistic verification to catch subtle mistakes before presenting answers to users.
- Safety‑critical pipelines: In domains like finance or engineering where erroneous calculations can be costly, a cheap “multiple‑eyes” check adds a valuable safety net.
- Long‑horizon reasoning: For tasks that require many reasoning steps (e.g., symbolic integration, theorem proving), the method helps prevent error propagation early, reducing the need for expensive re‑rollouts.
- Cost‑effective deployment: Because the technique leverages existing models and only modestly increases token usage, it fits well within API‑based pricing models and on‑device inference constraints.
Limitations & Future Work
- Verifier diversity: The current experiments mostly use the same model architecture for all parallel checks; exploring heterogeneous verifiers (different model sizes or fine‑tuned checkpoints) could further improve robustness.
- Latency: Running multiple verifiers in parallel adds wall‑clock time, which may be a bottleneck for real‑time applications unless optimized with batching or asynchronous execution.
- Dataset quality: The identified annotation errors highlight a broader need for cleaner benchmark data; future work should incorporate noise‑robust evaluation metrics.
- Beyond math: Extending pessimistic verification to other open‑ended domains (code generation, natural language reasoning) remains an open research direction.
Bottom line: Pessimistic verification offers a pragmatic, low‑overhead way to make LLM‑driven math solvers more trustworthy—an appealing proposition for developers who need reliable AI reasoning without breaking the bank.
Authors
- Yanxing Huang
- Zihan Tang
- Zejin Lin
- Peng Li
- Yang Liu
Paper Information
- arXiv ID: 2511.21522v1
- Categories: cs.AI
- Published: November 26, 2025
- PDF: Download PDF