[Paper] A Minimal Agent for Automated Theorem Proving

Published: 3 days ago (February 27, 2026 at 01:43 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.24273v1

Overview

The paper introduces a stripped‑down, “minimal” AI agent that can automatically prove mathematical theorems. By focusing only on the essential ingredients shared by modern neural theorem provers—iterative proof refinement, library search, and context handling—the authors create a lightweight baseline that can be used to benchmark and compare more complex systems. Despite its simplicity, the agent matches or exceeds the performance of many state‑of‑the‑art provers while consuming far fewer compute resources.

Key Contributions

Minimalist baseline architecture that isolates the core mechanisms of AI‑driven theorem proving.
Systematic benchmarking framework enabling fair, head‑to‑head comparisons across diverse models and design choices.
Empirical evidence that an iterative proof‑generation loop outperforms single‑shot generation in sample efficiency and cost.
Open‑source implementation released under a permissive license, positioned as a reference point for future research and as a ready‑to‑use prover for developers.
Comprehensive evaluation on qualitatively different benchmark suites (e.g., Lean, Coq, Isabelle) demonstrating competitive results across domains.

Methodology

Iterative Proof Refinement – The agent generates a partial proof step, checks it against the theorem prover’s kernel, and uses the feedback to guide the next generation. This loop continues until the proof is complete or a timeout occurs.
Library Search – Before each generation, the agent queries a curated library of previously proved lemmas, retrieving the most relevant ones based on semantic similarity to the current goal.
Context Management – The system maintains a lightweight context object that tracks the current sub‑goal, available hypotheses, and any imported lemmas, ensuring each generation is grounded in the correct logical environment.
Model Agnosticism – The architecture is model‑agnostic; the authors plug in several popular large language models (e.g., GPT‑3.5, LLaMA, CodeBERT) to demonstrate that the same pipeline works across different underlying neural back‑ends.
Evaluation Protocol – Benchmarks are split into “easy,” “medium,” and “hard” categories. For each, the authors measure success rate, number of generated tokens, and total inference cost, allowing a clear view of efficiency versus effectiveness.

Results & Findings

Benchmark	Success Rate (Minimal Agent)	Success Rate (SOTA)	Tokens per Proof (Avg.)
Lean (Easy)	92 %	94 %	1.8 k
Coq (Medium)	78 %	81 %	2.4 k
Isabelle (Hard)	61 %	65 %	3.1 k

The minimal agent consistently outperforms single‑shot baselines by 15–30 % in sample efficiency (fewer tokens needed to reach a proof).
Cost analysis shows up to a 40 % reduction in GPU‑hour consumption compared to leading end‑to‑end provers, thanks to the iterative feedback loop that avoids wasted generations.
Across all tested models, the iterative approach narrows the performance gap between smaller, cheaper models and larger, more expensive ones, suggesting that architecture matters as much as raw model size.

Practical Implications

Developer Tooling: The open‑source agent can be embedded into IDE extensions for Lean, Coq, or Isabelle, providing on‑the‑fly proof suggestions without requiring heavyweight cloud inference.
Cost‑Effective Research: Labs with limited compute budgets can experiment with theorem‑proving pipelines using modest GPUs, accelerating prototyping of new proof‑automation ideas.
Education & Training: A lightweight prover lowers the barrier for students and hobbyists to explore formal verification, enabling interactive tutorials that run locally.
Industry Automation: Companies building formal verification pipelines (e.g., for hardware or safety‑critical software) can adopt the iterative refinement loop to improve proof‑search efficiency, potentially reducing verification turnaround times.

Limitations & Future Work

Scalability to Very Large Libraries: While the current library search works well for medium‑sized corpora, performance may degrade on massive, heterogeneous lemma collections without more sophisticated indexing.
Model Dependency: The agent’s success still hinges on the underlying language model’s reasoning capabilities; extremely small models struggle on the hardest benchmarks.
Proof Explainability: The iterative loop yields a sequence of generated steps, but the system does not yet provide high‑level explanations or visualizations of the proof strategy.
Future Directions: The authors plan to integrate learned lemma‑ranking models, explore hybrid symbolic‑neural search, and extend the framework to support interactive proof debugging and user‑guided refinement.

Authors

Borja Requena Pozo
Austin Letson
Krystian Nowakowski
Izan Beltran Ferreiro
Leopoldo Sarra

Paper Information

arXiv ID: 2602.24273v1
Categories: cs.AI
Published: February 27, 2026
PDF: Download PDF

[Paper] A Minimal Agent for Automated Theorem Proving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Mode Seeking meets Mean Seeking for Fast Long Video Generation

[Paper] DARE-bench: Evaluating Modeling and Instruction Fidelity of LLMs in Data Science

[Paper] Do LLMs Benefit From Their Own Words?

[Paper] CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation