[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

Published: (April 23, 2026 at 01:57 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.21916v1

Overview

The paper MathDuels proposes a fresh way to benchmark large language models (LLMs) on mathematics: instead of only testing them as solvers on a static set of problems, the models also act as problem creators. By pitting models against each other in a self‑play “duel,” the authors can continuously raise the difficulty of the test set and expose strengths that traditional benchmarks miss.

Key Contributions

  • Dual‑role benchmark – Introduces a self‑play framework where every model both generates math problems and attempts to solve those generated by every other model.
  • Three‑stage problem‑generation pipeline – Combines meta‑prompting, problem generation, and difficulty amplification to produce well‑posed, challenging questions.
  • Independent verification step – An automated verifier filters out ambiguous or ill‑specified problems, ensuring only valid items enter the evaluation.
  • Rasch‑model based scoring – Uses a psychometric Rasch model to jointly estimate solver ability, problem difficulty, and author quality from the same interaction data.
  • Empirical study on 19 frontier models – Shows that problem‑authoring skill and solving skill are only partially correlated, revealing hidden capability gaps.
  • Live, evolving leaderboard – Publishes a public leaderboard that updates automatically as new models are added, keeping the benchmark from hitting a static ceiling.

Methodology

  1. Meta‑prompting – The model receives a high‑level instruction (e.g., “Create a challenging algebra problem for a peer model”). This primes the model to think like a problem setter.
  2. Problem Generation – The model writes a full problem statement, including any necessary definitions or constraints.
  3. Difficulty Amplification – A second prompt nudges the model to increase the problem’s complexity (e.g., “Add an extra variable or tighten the bound”).
  4. Verification – An independent verifier (a separate LLM plus rule‑based checks) runs the generated problem through a solver to confirm it is well‑posed and has a unique answer. Invalid items are discarded.
  5. Self‑play solving – Every model attempts to solve every problem authored by every other model, producing a matrix of solver‑vs‑author interactions.
  6. Rasch analysis – The interaction matrix feeds into a Rasch model, which simultaneously estimates:
    • Solver ability – How likely a model is to solve a problem of a given difficulty.
    • Problem difficulty – The intrinsic challenge of each generated problem.
    • Author quality – Derived from the average difficulty of the problems a model creates.

The whole pipeline is fully automated, allowing new models to be dropped in without manual curation.

Results & Findings

  • Partial decoupling of skills – Some models that excel at solving (e.g., GPT‑4‑Turbo) generate relatively easy problems, while others (e.g., Claude‑2) produce tougher questions despite modest solving scores.
  • Dynamic difficulty curve – As newer, stronger models join, they author problems that defeat previously top‑ranking solvers, preventing the benchmark from saturating.
  • Capability gaps uncovered – Traditional static benchmarks rated several models near the ceiling, but MathDuels revealed that a model could still be outperformed on adversarially generated problems.
  • Leaderboard dynamics – The public leaderboard shows a “chasing” pattern: a new model spikes in author quality, then existing solvers improve their scores after the community fine‑tunes prompting strategies.

Practical Implications

  • More realistic stress testing – Developers can use MathDuels to gauge how an LLM will behave when faced with user‑generated, potentially adversarial math queries, a scenario common in tutoring apps or code‑assistants.
  • Prompt‑engineering insights – The difficulty‑amplification stage highlights prompt patterns that push models toward harder reasoning, offering a recipe for building tougher evaluation suites.
  • Model selection for downstream products – Companies can prioritize models that not only solve but also generate high‑quality problems, useful for automated content creation (e.g., generating practice worksheets).
  • Continuous benchmarking pipeline – Because the benchmark evolves with each new model release, it can serve as a “living” test harness integrated into CI pipelines for AI products, ensuring regressions are caught early.

Limitations & Future Work

  • Verifier reliance – The current verification step depends on another LLM, which may occasionally misclassify borderline problems; a more formal theorem‑proving backend could improve robustness.
  • Scope of math domains – The study focuses mainly on algebra and calculus; extending to combinatorics, number theory, or applied math (e.g., physics‑style problems) would broaden applicability.
  • Rasch model assumptions – The Rasch model presumes unidimensional ability, which may oversimplify the multi‑faceted nature of mathematical reasoning (e.g., symbolic manipulation vs. logical deduction).
  • Human‑in‑the‑loop validation – Future work could incorporate expert human review to calibrate difficulty scores and catch subtle ambiguities that automated verifiers miss.

MathDuels opens the door to a more dynamic, adversarial, and informative way of measuring LLM capabilities—an approach that could become a staple in the toolbox of AI developers and product teams alike.

Authors

  • Zhiqiu Xu
  • Shibo Jin
  • Shreya Arya
  • Mayur Naik

Paper Information

  • arXiv ID: 2604.21916v1
  • Categories: cs.CL, cs.SE
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »