[Paper] Aletheia tackles FirstProof autonomously

Published: (February 24, 2026 at 01:56 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.21201v1

Overview

The paper presents Aletheia, an autonomous mathematics‑research agent built on the Gemini 3 Deep Think language model. In the inaugural FirstProof competition, Aletheia tackled ten open‑ended proof problems and, within the contest window, produced correct solutions for six of them (problems 2, 5, 7, 8, 9, 10) as judged by a panel of experts. The work showcases how large‑scale generative models can be harnessed for high‑level mathematical reasoning without human‑in‑the‑loop guidance.

Key Contributions

  • First autonomous entry to the FirstProof challenge, demonstrating end‑to‑end problem selection, reasoning, and proof generation.
  • Integration of Gemini 3 Deep Think with a specialized prompting pipeline that elicits step‑by‑step mathematical reasoning.
  • Empirical benchmark: 6/10 problems solved to expert‑majority satisfaction, with detailed analysis of each solution’s strengths and weaknesses.
  • Open‑source transparency: full prompt‑output logs released on GitHub, enabling reproducibility and community inspection.
  • Evaluation framework for assessing AI‑generated proofs via majority expert voting, highlighting the nuances of consensus in mathematical validation.

Methodology

  1. Problem Ingestion – The raw FirstProof statements (including definitions and constraints) are fed to the agent via a standardized JSON schema.
  2. Prompt Engineering – A multi‑stage prompt template guides Gemini 3 to (a) restate the problem, (b) outline a high‑level proof strategy, (c) expand each step with formal reasoning, and (d) produce a final proof write‑up.
  3. Self‑Verification Loop – After generating a draft, Aletheia runs a lightweight symbolic checker (e.g., SymPy) on intermediate lemmas and asks the model to revise any flagged inconsistencies.
  4. Time‑Bound Execution – All reasoning runs are capped at the competition’s wall‑clock limit, ensuring the system operates autonomously without external intervention.
  5. Human Evaluation – A panel of mathematicians reviews each proof, voting “correct”, “partially correct”, or “incorrect”. Majority votes determine the final success metric.

The pipeline is deliberately kept modular so developers can swap out the underlying LLM, the verification backend, or the prompting style without redesigning the whole system.

Results & Findings

ProblemExpert VerdictNotes
1IncorrectModel missed a subtle counterexample.
2CorrectClean constructive proof, fully verified.
3IncorrectStalled on a non‑trivial combinatorial identity.
4IncorrectProof diverged into an unrelated domain.
5CorrectLeveraged a known inequality elegantly.
6IncorrectFailed to close a gap in an induction step.
7CorrectDemonstrated novel use of generating functions.
8Majority Correct (1 dissent)Minor notation ambiguity; still acceptable.
9CorrectCombined algebraic manipulation with geometric insight.
10CorrectProduced a concise proof that matched textbook style.

Overall, Aletheia achieved a 60 % majority‑correct rate. The successes were concentrated in problems where a constructive or algorithmic proof strategy existed, while failures tended to involve deep combinatorial insight or obscure lemmas not present in the model’s training data.

Practical Implications

  • Developer Tooling – Aletheia’s prompting stack can be repurposed for automated theorem proving assistants, code‑generation tools that need formal correctness guarantees, or symbolic reasoning modules in scientific software.
  • Accelerated Research – Researchers can use a similar agent to explore proof sketches, generate conjecture‑driven lemmas, or perform “dry‑run” verification before investing human time.
  • Education – The step‑by‑step reasoning output serves as a tutoring aid, showing students how a large language model structures a proof, which can be integrated into interactive learning platforms.
  • Productivity Gains – In domains like cryptography, control theory, or formal verification, an autonomous proof generator can pre‑screen design constraints, flag impossible specifications, or suggest constructive alternatives.
  • Open‑source Ecosystem – By publishing the raw prompts and outputs, the authors invite the community to benchmark other LLMs, improve verification loops, or build domain‑specific extensions (e.g., for topology or number theory).

Limitations & Future Work

  • Coverage Gaps – The agent struggled with problems requiring deep combinatorial intuition or obscure lemmas not well represented in its training corpus.
  • Verification Bottleneck – The lightweight symbolic checker cannot certify all higher‑order reasoning steps; integrating a full proof assistant (Coq, Lean) remains an open challenge.
  • Consensus Ambiguity – The reliance on majority expert voting means borderline proofs may be classified inconsistently; a more granular scoring rubric could provide clearer feedback.
  • Scalability – Current experiments were limited to a fixed time budget; scaling to larger problem sets or longer proofs will demand more efficient prompting and possibly model distillation.
  • Future Directions – The authors plan to (a) augment the model with retrieval of specialized mathematical literature, (b) tighten the self‑verification loop with theorem‑prover back‑ends, and (c) explore few‑shot adaptation to new mathematical sub‑domains.

Bottom line: Aletheia demonstrates that today’s LLMs, when paired with thoughtful prompting and modest symbolic checks, can autonomously produce mathematically rigorous arguments at a level that already benefits developers, researchers, and educators. Continued refinements promise even broader applicability across the tech industry’s most logic‑intensive challenges.

Authors

  • Tony Feng
  • Junehyuk Jung
  • Sang-hyun Kim
  • Carlo Pagano
  • Sergei Gukov
  • Chiang-Chiang Tsai
  • David Woodruff
  • Adel Javanmard
  • Aryan Mokhtari
  • Dawsen Hwang
  • Yuri Chervonyi
  • Jonathan N. Lee
  • Garrett Bingham
  • Trieu H. Trinh
  • Vahab Mirrokni
  • Quoc V. Le
  • Thang Luong

Paper Information

  • arXiv ID: 2602.21201v1
  • Categories: cs.AI, cs.CL, cs.LG
  • Published: February 24, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »