Our First Proof submissions

Published: 3 days ago (February 20, 2026 at 09:30 AM EST)

4 min read

Source: OpenAI Blog

## February 20, 2026

- [Research](/news/research/)
- [Conclusion](/research/index/conclusion/)

Our First Proof Submissions

We’re sharing our proof attempts for First Proof, a math challenge that tests whether AI can produce checkable proofs on domain‑specific problems.

View our set of proof attempts (opens in a new window)

Background

We ran an internal model on all 10 First Proof (opens in a new window) problems, a research‑level math challenge designed to test whether AI systems can produce correct, checkable proof attempts.

Unlike short‑answer or competition‑style math, these problems require building end‑to‑end arguments in specialized domains, and correctness is hard to establish without expert review.
The authors of the First Proof problems are leading experts in their respective fields; several of the problems were open for years before the authors found solutions.
An academic department with substantial overlap with the subject areas could conceivably solve many of the problems in one week.

We shared our proof attempts on Saturday, February 14 2026 at 12:00 AM PT. Based on feedback from experts:

At least five of the model’s proof attempts (problems 4, 5, 6, 9, and 10) have a high chance of being correct.
Several others remain under review.
We initially believed our attempt for problem 2 was likely correct; after the official First Proof commentary and further community analysis, we now think it is incorrect.

Our full set of proof attempts can be found here (opens in a new window). The preprint includes all ten proof attempts plus a newly added appendix with prompt patterns and examples that simulate our manual interactions with the models during the process.

Why Frontier Research Matters

Benchmarks are useful, but they can miss some of the hardest parts of research:

Sustaining long chains of reasoning
Choosing the right abstractions
Handling ambiguity in problem statements
Producing arguments that survive expert scrutiny

Frontier challenges like First Proof help us stress‑test those capabilities in settings where correctness is non‑trivial to verify and failure modes are informative.

“We’re currently training a new model for which a primary focus is increasing the level of rigor in its thinking, with the goal that the model can think continuously for many hours and remain highly confident in its conclusions. When the First Proof problems were announced, it seemed like the perfect testbed, so over the weekend I tried it out. Already it was able to solve two of the problems (#9 and #10). As it trained, it became increasingly capable, eventually solving—in our estimation—at least three more. We were particularly pleased when it solved #6 and then, two days later, #4, as those problems were from fields familiar to many of us. It’s pretty incredible to watch a model get tangibly smarter day by day.”
— James R. Lee, OpenAI Researcher, Reasoning

We ran the model with limited human supervision. When prompting versions of the model during training, we sometimes suggested retrying strategies that appeared fruitful in earlier attempts. For some attempts, we asked the model to expand or clarify parts of a proof after receiving expert feedback, making the reasoning easier to verify. We also facilitated a back‑and‑forth between this model and ChatGPT for verification, formatting, and style. For each problem we present the best of a few attempts, selected by human judgment.

This was a fast sprint, and our process was not as clean as we would like in a properly controlled evaluation. We look forward to discussions with the First Proof organizers about a more rigorous experiment and evaluation framework for future iterations.

July 2025 – Reached gold‑medal‑level performance on the International Mathematical Olympiad with a general‑purpose reasoning model (35/42 points).
Tweet announcing the result
November 2025 – Published “Early experiments in accelerating science with GPT‑5,” a set of case studies where GPT‑5 helped researchers make concrete progress across math, physics, biology, and other fields, along with observed limitations.
Most recent – Reported a physics collaboration where GPT‑5.2 proposed a candidate expression for a gluon‑amplitude formula that was then formally proved by an internal model and verified by the authors.

We look forward to deeper engagement with the community on how to evaluate research‑grade reasoning, including expert feedback on these attempts, and we’re excited to make these new capabilities available in future public models.

— 2026 (see other 2026 news at /news/?tags=2026)

Author

OpenAI

Keep reading

GPT‑5.2 derives a new result in theoretical physics
Research – Feb 13, 2026
Scaling social science research
Global Affairs – Feb 13, 2026
![Scaling social science]https://images.ctfassets.net/kftzwdyauwt9/6gVMyxLHoS9lsG81UZfOvM/9c3560021ab68bd7594d0c06fdb8a3d0/Scaling-social-science_1x1.png)
GPT‑5 lowers the cost of cell‑free protein synthesis
Research – Feb 5, 2026
![Cell‑free protein synthesis]https://images.ctfassets.net/kftzwdyauwt9/2SsfYMwzLvPgXAwHZJn3ns/c84388126c95fce54b1106a8443557c8/ginkgo_art_card.png)

Our First Proof submissions

Our First Proof Submissions

Background

Why Frontier Research Matters

Author

Keep reading

Related posts

Google’s new Gemini Pro model has record benchmark scores—again

Google’s new Gemini Pro model has record benchmark scores — again

4 AI Models (That aren’t Opus 4.6) on Our Minds This Week

Guide Labs debuts a new kind of interpretable LLM

Our First Proof Submissions

Background

Why Frontier Research Matters

Related Milestones

Author

Keep reading

Related posts

Google’s new Gemini Pro model has record benchmark scores—again

Google’s new Gemini Pro model has record benchmark scores — again

4 AI Models (That aren’t Opus 4.6) on Our Minds This Week

Guide Labs debuts a new kind of interpretable LLM