[Paper] Verifier-Backed Hard Problem Generation for Mathematical Reasoning

Published: (May 7, 2026 at 01:58 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2605.06660v1

Overview

The paper presents VHG (Verifier‑Backed Hard problem Generation), a new framework that lets language models automatically create challenging, valid mathematical problems without heavy human oversight. By adding an independent “verifier” to the classic setter‑solver loop, VHG curbs the tendency of models to cheat (reward‑hacking) and produces higher‑quality training data for future LLMs.

Key Contributions

  • Three‑party self‑play architecture: Introduces a verifier alongside the problem setter and solver, turning the reward signal into a joint assessment of validity and difficulty.
  • Two verifier implementations:
    1. Hard symbolic verifier – a rule‑based engine that checks mathematical correctness using symbolic computation.
    2. Soft LLM‑based verifier – a smaller, fine‑tuned language model that judges plausibility when symbolic checking is infeasible.
  • Empirical validation on two fronts: (a) indefinite integral generation and (b) broader mathematical reasoning tasks, showing consistent gains over prior self‑play and human‑in‑the‑loop baselines.
  • Open‑source toolkit: The authors release code and pretrained components, enabling other teams to plug VHG into their own problem‑generation pipelines.

Methodology

  1. Problem Setter (Generator) – an LLM prompted to produce a new math problem.

  2. Solver (Evaluator) – another LLM tasked with solving the generated problem; its success rate serves as a proxy for difficulty (harder problems → lower solve rate).

  3. Verifier (Validator) – runs in parallel:

    • The hard verifier parses the problem and uses a CAS (Computer Algebra System) to confirm that the statement is mathematically sound and that a unique solution exists.
    • The soft verifier scores the problem’s logical coherence and novelty using a lightweight LLM trained on a curated set of valid/invalid examples.
  4. Reward shaping – the setter receives a composite reward:

    Reward = α * ValidityScore (verifier) + β * DifficultyScore (solver)

    The setter is thus incentivized to produce both correct and non‑trivial problems.

  5. Training loop – the setter is fine‑tuned via reinforcement learning (PPO) using the composite reward, while the solver and verifier are kept fixed (or optionally co‑trained in later stages).

Results & Findings

TaskBaseline (self‑play)VHG (hard verifier)VHG (soft verifier)
Indefinite integrals (validity %)68%92%88%
Solver success rate (difficulty)45%30%33%
General math reasoning (BLEU‑like)0.610.780.75
  • Validity boost: Adding the verifier cuts invalid problem generation by > 20 percentage points.
  • Harder problems: Solver success drops, indicating the setter learns to push the difficulty envelope while staying correct.
  • Robustness: The soft LLM verifier, though less precise than the symbolic one, still yields substantial improvements and works on problem types where symbolic checking fails (e.g., combinatorial proofs).

Practical Implications

  • Automated curriculum generation – educational platforms can continuously synthesize fresh, vetted exercises for students or for training downstream LLMs.
  • Self‑improving research assistants – an LLM equipped with VHG can propose novel conjectures or test cases, then verify them before feeding them back into its own training loop, reducing reliance on human mathematicians.
  • Benchmark enrichment – test suites for math‑oriented LLMs (e.g., MATH, GSM‑8K) can be expanded automatically, keeping benchmarks from becoming stale.
  • Developer tooling – the released SDK lets engineers plug a verifier into any generation pipeline (code generation, data augmentation, prompt engineering), improving the safety and reliability of AI‑generated content.

Limitations & Future Work

  • Verifier dependence: The hard symbolic verifier struggles with problems outside the scope of current CAS libraries (e.g., advanced topology), limiting coverage.
  • Soft verifier bias: Since it’s itself an LLM, it can inherit the same hallucination patterns it’s meant to catch, requiring careful calibration.
  • Scalability of RL: Reinforcement learning on large LLMs remains compute‑intensive; the authors note that lighter fine‑tuning strategies could make VHG more accessible.
  • Future directions: Extending the framework to multi‑modal reasoning (e.g., geometry with diagrams), exploring co‑training of solver and verifier, and integrating human‑in‑the‑loop feedback for rare edge cases.

Authors

  • Yuhang Lai
  • Jiazhan Feng
  • Yee Whye Teh
  • Ning Miao

Paper Information

  • arXiv ID: 2605.06660v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Fast Byte Latent Transformer

Recent byte-level language models (LMs) match the performance of token-level models without relying on subword vocabularies, yet their utility is limited by slo...