[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers
Source: arXiv - 2604.21916v1
Overview
The paper MathDuels proposes a fresh way to benchmark large language models (LLMs) on mathematics: instead of only testing them as solvers on a static set of problems, the models also act as problem creators. By pitting models against each other in a self‑play “duel,” the authors can continuously raise the difficulty of the test set and expose strengths that traditional benchmarks miss.
Key Contributions
- Dual‑role benchmark – Introduces a self‑play framework where every model both generates math problems and attempts to solve those generated by every other model.
- Three‑stage problem‑generation pipeline – Combines meta‑prompting, problem generation, and difficulty amplification to produce well‑posed, challenging questions.
- Independent verification step – An automated verifier filters out ambiguous or ill‑specified problems, ensuring only valid items enter the evaluation.
- Rasch‑model based scoring – Uses a psychometric Rasch model to jointly estimate solver ability, problem difficulty, and author quality from the same interaction data.
- Empirical study on 19 frontier models – Shows that problem‑authoring skill and solving skill are only partially correlated, revealing hidden capability gaps.
- Live, evolving leaderboard – Publishes a public leaderboard that updates automatically as new models are added, keeping the benchmark from hitting a static ceiling.
Methodology
- Meta‑prompting – The model receives a high‑level instruction (e.g., “Create a challenging algebra problem for a peer model”). This primes the model to think like a problem setter.
- Problem Generation – The model writes a full problem statement, including any necessary definitions or constraints.
- Difficulty Amplification – A second prompt nudges the model to increase the problem’s complexity (e.g., “Add an extra variable or tighten the bound”).
- Verification – An independent verifier (a separate LLM plus rule‑based checks) runs the generated problem through a solver to confirm it is well‑posed and has a unique answer. Invalid items are discarded.
- Self‑play solving – Every model attempts to solve every problem authored by every other model, producing a matrix of solver‑vs‑author interactions.
- Rasch analysis – The interaction matrix feeds into a Rasch model, which simultaneously estimates:
- Solver ability – How likely a model is to solve a problem of a given difficulty.
- Problem difficulty – The intrinsic challenge of each generated problem.
- Author quality – Derived from the average difficulty of the problems a model creates.
The whole pipeline is fully automated, allowing new models to be dropped in without manual curation.
Results & Findings
- Partial decoupling of skills – Some models that excel at solving (e.g., GPT‑4‑Turbo) generate relatively easy problems, while others (e.g., Claude‑2) produce tougher questions despite modest solving scores.
- Dynamic difficulty curve – As newer, stronger models join, they author problems that defeat previously top‑ranking solvers, preventing the benchmark from saturating.
- Capability gaps uncovered – Traditional static benchmarks rated several models near the ceiling, but MathDuels revealed that a model could still be outperformed on adversarially generated problems.
- Leaderboard dynamics – The public leaderboard shows a “chasing” pattern: a new model spikes in author quality, then existing solvers improve their scores after the community fine‑tunes prompting strategies.
Practical Implications
- More realistic stress testing – Developers can use MathDuels to gauge how an LLM will behave when faced with user‑generated, potentially adversarial math queries, a scenario common in tutoring apps or code‑assistants.
- Prompt‑engineering insights – The difficulty‑amplification stage highlights prompt patterns that push models toward harder reasoning, offering a recipe for building tougher evaluation suites.
- Model selection for downstream products – Companies can prioritize models that not only solve but also generate high‑quality problems, useful for automated content creation (e.g., generating practice worksheets).
- Continuous benchmarking pipeline – Because the benchmark evolves with each new model release, it can serve as a “living” test harness integrated into CI pipelines for AI products, ensuring regressions are caught early.
Limitations & Future Work
- Verifier reliance – The current verification step depends on another LLM, which may occasionally misclassify borderline problems; a more formal theorem‑proving backend could improve robustness.
- Scope of math domains – The study focuses mainly on algebra and calculus; extending to combinatorics, number theory, or applied math (e.g., physics‑style problems) would broaden applicability.
- Rasch model assumptions – The Rasch model presumes unidimensional ability, which may oversimplify the multi‑faceted nature of mathematical reasoning (e.g., symbolic manipulation vs. logical deduction).
- Human‑in‑the‑loop validation – Future work could incorporate expert human review to calibrate difficulty scores and catch subtle ambiguities that automated verifiers miss.
MathDuels opens the door to a more dynamic, adversarial, and informative way of measuring LLM capabilities—an approach that could become a staple in the toolbox of AI developers and product teams alike.
Authors
- Zhiqiu Xu
- Shibo Jin
- Shreya Arya
- Mayur Naik
Paper Information
- arXiv ID: 2604.21916v1
- Categories: cs.CL, cs.SE
- Published: April 23, 2026
- PDF: Download PDF