MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone
Source: MIT News - AI
Overview
Every year, the countries competing in the International Mathematical Olympiad (IMO) arrive with a booklet of their best, most original problems. Those booklets get shared among delegations, then quietly disappear. No one had ever collected them systematically, cleaned them, and made them available—not for AI researchers testing the limits of mathematical reasoning, and not for the students around the world training for these competitions largely on their own.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST), and the company HUMAIN have now done exactly that.
What is MathNet?
MathNet is the largest high‑quality dataset of proof‑based math problems ever created.
- Size: > 30,000 expert‑authored problems and solutions
- Coverage: 47 countries, 17 languages, 143 competitions
- Scale: Five times larger than the next‑biggest dataset of its kind
The work will be presented at the International Conference on Learning Representations (ICLR) in Brazil later this month.
Why MathNet Is Different
What makes MathNet different is not only its size, but its breadth.
- Previous Olympiad‑level datasets draw almost exclusively from competitions in the United States and China.
- MathNet spans dozens of countries across six continents, covers 17 languages, includes both text‑ and image‑based problems and solutions, and spans four decades of competition mathematics.
“Every country brings a booklet of its most novel and most creative problems,” says Shaden Alshammari, an MIT PhD student and lead author on the paper. “They share the booklets with each other, but no one had made the effort to collect them, clean them, and upload them online.”
Building the Dataset
- Source material: 1,595 PDF volumes totaling more than 25,000 pages, ranging from digital documents to decades‑old scans.
- Key contributor: Navid Safaei, a longtime IMO community figure and co‑author, who had been collecting and scanning those booklets by hand since 2006. His personal archive formed much of the backbone of the dataset.
Sourcing Matters
Where most existing math datasets pull problems from community forums like Art of Problem Solving (AoPS), MathNet draws exclusively from official national competition booklets.
- Solutions are expert‑written, peer‑reviewed, and often run to multiple pages, with authors walking through several approaches to the same problem.
- This depth gives AI models a far richer signal for learning mathematical reasoning than the shorter, informal solutions typical of community‑sourced datasets.
“I remember so many students for whom it was an individual effort. No one in their country was training them for this kind of competition,” says Alshammari, who competed in the IMO as a student herself. “We hope this gives them a centralized place with high‑quality problems and solutions to learn from.”
Community Involvement
- Team ties: Sultan Albarakati, a co‑author, currently serves on the IMO board.
- Validation: A grading group of more than 30 human evaluators from Armenia, Russia, Ukraine, Vietnam, Poland, and other countries verified thousands of solutions.
“The MathNet database has the potential to be an excellent resource for both students and leaders seeking new problems to work on or looking for the solution to a difficult question,” says Tanish Patil, deputy leader of Switzerland’s IMO. “Whilst other archives of Olympiad problems do exist (notably, the Contest Collections forums on AoPS), these resources lack standardized formatting system, verified solutions, and important problem metadata that topics and theory require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and if we will soon be able to reliably answer an important issue when creating novel Olympiad questions: determining if a problem is truly original.”
Benchmarking AI Performance
MathNet also functions as a rigorous benchmark for AI performance, and the results reveal a more complicated picture than recent headlines about AI math prowess might suggest.
- Frontier models: Some have reportedly achieved gold‑medal performance at the IMO and now solve problems that would stump most humans.
- MathNet results: Even GPT‑5, the top‑performing model tested, averaged ≈ 69.3 % on MathNet’s main benchmark of 6,400 problems, failing nearly one‑in‑three Olympiad‑level problems.
- Visual reasoning: When problems include figures, performance drops significantly across the board, exposing visual reasoning as a consistent weak point for even the most capable models.
Language Gaps
Several open‑source models scored 0 % on Mongolian‑language problems, highlighting another dimension where current AI systems fall short despite their overall strength.
“GPT models are equally good in English and other languages,” Alshammari says. “But many of the open‑source models fail completely at less‑common languages, such as Mongolian.”
Broader Impact
The diversity of MathNet is also designed to address a deeper limitation in how AI models learn mathematics. When training data skews toward English and Chinese problems, models absorb a narrow slice of mathematical culture. A Romanian combinatorics problem or a Brazilian number‑theory
… (the original text ends abruptly here).
MathNet: A New Benchmark for Mathematical Reasoning
Key Insight
The authors argue that exposing both humans and AI systems to a wide variety of problems—even those that appear similar on the surface but differ in underlying structure—helps develop stronger mathematical thinking skills.
Retrieval Benchmark
- Goal: Test whether models can recognize when two problems share the same underlying mathematical structure.
- Motivation:
- Near‑duplicate problems have appeared in real IMO exams over the years.
- Identifying mathematical equivalences across different notations, languages, and formats is challenging, even for expert human committees.
- Findings:
- Eight state‑of‑the‑art embedding models were evaluated.
- The strongest model identified the correct match only ~5 % of the time on the first try.
- Models often ranked structurally unrelated problems as more similar than truly equivalent ones.
Retrieval‑Augmented Generation Benchmark
- Purpose: Determine if providing a model with a structurally related problem before asking it to solve a new one improves performance.
- Results:
- Performance improved when the retrieved problem was genuinely relevant.
- DeepSeek‑V3.2‑Speciale gained up to 12 percentage points with well‑matched retrieval.
- Irrelevant retrieval degraded performance in roughly 22 % of cases.
Authors & Funding
- Authors:
- Shaden Alshammari (lead author)
- Navid Safaei
- HUMAIN AI engineer Abrar Zainal
- Sultan Albarakati (KAUST Academy Director)
- MIT CSAIL colleagues:
- Master’s student Kevin Wen (SB ’25)
- Microsoft Principal Engineering Manager Mark Hamilton (SM ’22, PhD ’25)
- Professors William Freeman and Antonio Torralba
- Funding:
- Schwarzman College of Computing Fellowship
- National Science Foundation
Access
MathNet is publicly available at: .