[Paper] Aletheia tackles FirstProof autonomously

Published: 3 days ago (February 24, 2026 at 01:56 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.21201v1

Overview

The paper presents Aletheia, an autonomous mathematics‑research agent built on the Gemini 3 Deep Think language model. In the inaugural FirstProof competition, Aletheia tackled ten open‑ended proof problems and, within the contest window, produced correct solutions for six of them (problems 2, 5, 7, 8, 9, 10) as judged by a panel of experts. The work showcases how large‑scale generative models can be harnessed for high‑level mathematical reasoning without human‑in‑the‑loop guidance.

Key Contributions

First autonomous entry to the FirstProof challenge, demonstrating end‑to‑end problem selection, reasoning, and proof generation.
Integration of Gemini 3 Deep Think with a specialized prompting pipeline that elicits step‑by‑step mathematical reasoning.
Empirical benchmark: 6/10 problems solved to expert‑majority satisfaction, with detailed analysis of each solution’s strengths and weaknesses.
Open‑source transparency: full prompt‑output logs released on GitHub, enabling reproducibility and community inspection.
Evaluation framework for assessing AI‑generated proofs via majority expert voting, highlighting the nuances of consensus in mathematical validation.

Methodology

Problem Ingestion – The raw FirstProof statements (including definitions and constraints) are fed to the agent via a standardized JSON schema.
Prompt Engineering – A multi‑stage prompt template guides Gemini 3 to (a) restate the problem, (b) outline a high‑level proof strategy, (c) expand each step with formal reasoning, and (d) produce a final proof write‑up.
Self‑Verification Loop – After generating a draft, Aletheia runs a lightweight symbolic checker (e.g., SymPy) on intermediate lemmas and asks the model to revise any flagged inconsistencies.
Time‑Bound Execution – All reasoning runs are capped at the competition’s wall‑clock limit, ensuring the system operates autonomously without external intervention.
Human Evaluation – A panel of mathematicians reviews each proof, voting “correct”, “partially correct”, or “incorrect”. Majority votes determine the final success metric.

The pipeline is deliberately kept modular so developers can swap out the underlying LLM, the verification backend, or the prompting style without redesigning the whole system.

Results & Findings

Problem	Expert Verdict	Notes
1	Incorrect	Model missed a subtle counterexample.
2	Correct	Clean constructive proof, fully verified.
3	Incorrect	Stalled on a non‑trivial combinatorial identity.
4	Incorrect	Proof diverged into an unrelated domain.
5	Correct	Leveraged a known inequality elegantly.
6	Incorrect	Failed to close a gap in an induction step.
7	Correct	Demonstrated novel use of generating functions.
8	Majority Correct (1 dissent)	Minor notation ambiguity; still acceptable.
9	Correct	Combined algebraic manipulation with geometric insight.
10	Correct	Produced a concise proof that matched textbook style.

Overall, Aletheia achieved a 60 % majority‑correct rate. The successes were concentrated in problems where a constructive or algorithmic proof strategy existed, while failures tended to involve deep combinatorial insight or obscure lemmas not present in the model’s training data.

Practical Implications

Developer Tooling – Aletheia’s prompting stack can be repurposed for automated theorem proving assistants, code‑generation tools that need formal correctness guarantees, or symbolic reasoning modules in scientific software.
Accelerated Research – Researchers can use a similar agent to explore proof sketches, generate conjecture‑driven lemmas, or perform “dry‑run” verification before investing human time.
Education – The step‑by‑step reasoning output serves as a tutoring aid, showing students how a large language model structures a proof, which can be integrated into interactive learning platforms.
Productivity Gains – In domains like cryptography, control theory, or formal verification, an autonomous proof generator can pre‑screen design constraints, flag impossible specifications, or suggest constructive alternatives.
Open‑source Ecosystem – By publishing the raw prompts and outputs, the authors invite the community to benchmark other LLMs, improve verification loops, or build domain‑specific extensions (e.g., for topology or number theory).

Limitations & Future Work

Coverage Gaps – The agent struggled with problems requiring deep combinatorial intuition or obscure lemmas not well represented in its training corpus.
Verification Bottleneck – The lightweight symbolic checker cannot certify all higher‑order reasoning steps; integrating a full proof assistant (Coq, Lean) remains an open challenge.
Consensus Ambiguity – The reliance on majority expert voting means borderline proofs may be classified inconsistently; a more granular scoring rubric could provide clearer feedback.
Scalability – Current experiments were limited to a fixed time budget; scaling to larger problem sets or longer proofs will demand more efficient prompting and possibly model distillation.
Future Directions – The authors plan to (a) augment the model with retrieval of specialized mathematical literature, (b) tighten the self‑verification loop with theorem‑prover back‑ends, and (c) explore few‑shot adaptation to new mathematical sub‑domains.

Bottom line: Aletheia demonstrates that today’s LLMs, when paired with thoughtful prompting and modest symbolic checks, can autonomously produce mathematically rigorous arguments at a level that already benefits developers, researchers, and educators. Continued refinements promise even broader applicability across the tech industry’s most logic‑intensive challenges.

Authors

Tony Feng
Junehyuk Jung
Sang-hyun Kim
Carlo Pagano
Sergei Gukov
Chiang-Chiang Tsai
David Woodruff
Adel Javanmard
Aryan Mokhtari
Dawsen Hwang
Yuri Chervonyi
Jonathan N. Lee
Garrett Bingham
Trieu H. Trinh
Vahab Mirrokni
Quoc V. Le
Thang Luong

Paper Information

arXiv ID: 2602.21201v1
Categories: cs.AI, cs.CL, cs.LG
Published: February 24, 2026
PDF: Download PDF

[Paper] Aletheia tackles FirstProof autonomously

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] LLM Novice Uplift on Dual-Use, In Silico Biology Tasks

[Paper] SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

[Paper] AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

[Paper] Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?