[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

Published: 1 day ago (April 23, 2026 at 01:57 PM EDT)

5 min read

Source: arXiv

Source: arXiv - 2604.21916v1

Overview

The paper MathDuels proposes a fresh way to benchmark large language models (LLMs) on mathematics: instead of only testing them as solvers on a static set of problems, the models also act as problem creators. By pitting models against each other in a self‑play “duel,” the authors can continuously raise the difficulty of the test set and expose strengths that traditional benchmarks miss.

Key Contributions

Dual‑role benchmark – Introduces a self‑play framework where every model both generates math problems and attempts to solve those generated by every other model.
Three‑stage problem‑generation pipeline – Combines meta‑prompting, problem generation, and difficulty amplification to produce well‑posed, challenging questions.
Independent verification step – An automated verifier filters out ambiguous or ill‑specified problems, ensuring only valid items enter the evaluation.
Rasch‑model based scoring – Uses a psychometric Rasch model to jointly estimate solver ability, problem difficulty, and author quality from the same interaction data.
Empirical study on 19 frontier models – Shows that problem‑authoring skill and solving skill are only partially correlated, revealing hidden capability gaps.
Live, evolving leaderboard – Publishes a public leaderboard that updates automatically as new models are added, keeping the benchmark from hitting a static ceiling.

Methodology

Meta‑prompting – The model receives a high‑level instruction (e.g., “Create a challenging algebra problem for a peer model”). This primes the model to think like a problem setter.
Problem Generation – The model writes a full problem statement, including any necessary definitions or constraints.
Difficulty Amplification – A second prompt nudges the model to increase the problem’s complexity (e.g., “Add an extra variable or tighten the bound”).
Verification – An independent verifier (a separate LLM plus rule‑based checks) runs the generated problem through a solver to confirm it is well‑posed and has a unique answer. Invalid items are discarded.
Self‑play solving – Every model attempts to solve every problem authored by every other model, producing a matrix of solver‑vs‑author interactions.
Rasch analysis – The interaction matrix feeds into a Rasch model, which simultaneously estimates:
- Solver ability – How likely a model is to solve a problem of a given difficulty.
- Problem difficulty – The intrinsic challenge of each generated problem.
- Author quality – Derived from the average difficulty of the problems a model creates.

The whole pipeline is fully automated, allowing new models to be dropped in without manual curation.

Results & Findings

Partial decoupling of skills – Some models that excel at solving (e.g., GPT‑4‑Turbo) generate relatively easy problems, while others (e.g., Claude‑2) produce tougher questions despite modest solving scores.
Dynamic difficulty curve – As newer, stronger models join, they author problems that defeat previously top‑ranking solvers, preventing the benchmark from saturating.
Capability gaps uncovered – Traditional static benchmarks rated several models near the ceiling, but MathDuels revealed that a model could still be outperformed on adversarially generated problems.
Leaderboard dynamics – The public leaderboard shows a “chasing” pattern: a new model spikes in author quality, then existing solvers improve their scores after the community fine‑tunes prompting strategies.

Practical Implications

More realistic stress testing – Developers can use MathDuels to gauge how an LLM will behave when faced with user‑generated, potentially adversarial math queries, a scenario common in tutoring apps or code‑assistants.
Prompt‑engineering insights – The difficulty‑amplification stage highlights prompt patterns that push models toward harder reasoning, offering a recipe for building tougher evaluation suites.
Model selection for downstream products – Companies can prioritize models that not only solve but also generate high‑quality problems, useful for automated content creation (e.g., generating practice worksheets).
Continuous benchmarking pipeline – Because the benchmark evolves with each new model release, it can serve as a “living” test harness integrated into CI pipelines for AI products, ensuring regressions are caught early.

Limitations & Future Work

Verifier reliance – The current verification step depends on another LLM, which may occasionally misclassify borderline problems; a more formal theorem‑proving backend could improve robustness.
Scope of math domains – The study focuses mainly on algebra and calculus; extending to combinatorics, number theory, or applied math (e.g., physics‑style problems) would broaden applicability.
Rasch model assumptions – The Rasch model presumes unidimensional ability, which may oversimplify the multi‑faceted nature of mathematical reasoning (e.g., symbolic manipulation vs. logical deduction).
Human‑in‑the‑loop validation – Future work could incorporate expert human review to calibrate difficulty scores and catch subtle ambiguities that automated verifiers miss.

MathDuels opens the door to a more dynamic, adversarial, and informative way of measuring LLM capabilities—an approach that could become a staple in the toolbox of AI developers and product teams alike.

Authors

Zhiqiu Xu
Shibo Jin
Shreya Arya
Mayur Naik

Paper Information

arXiv ID: 2604.21916v1
Categories: cs.CL, cs.SE
Published: April 23, 2026
PDF: Download PDF

[Paper] MathDuels: Evaluating LLMs as Problem Posers and Solvers

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Evaluation of Automatic Speech Recognition Using Generative Large Language Models

[Paper] When Prompts Override Vision: Prompt-Induced Hallucinations in LVLMs

[Paper] GiVA: Gradient-Informed Bases for Vector-Based Adaptation

[Paper] Mapping the Political Discourse in the Brazilian Chamber of Deputies: A Multi-Faceted Computational Approach