[Paper] A Minimal Agent for Automated Theorem Proving

Published: (February 27, 2026 at 01:43 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.24273v1

Overview

The paper introduces a stripped‑down, “minimal” AI agent that can automatically prove mathematical theorems. By focusing only on the essential ingredients shared by modern neural theorem provers—iterative proof refinement, library search, and context handling—the authors create a lightweight baseline that can be used to benchmark and compare more complex systems. Despite its simplicity, the agent matches or exceeds the performance of many state‑of‑the‑art provers while consuming far fewer compute resources.

Key Contributions

  • Minimalist baseline architecture that isolates the core mechanisms of AI‑driven theorem proving.
  • Systematic benchmarking framework enabling fair, head‑to‑head comparisons across diverse models and design choices.
  • Empirical evidence that an iterative proof‑generation loop outperforms single‑shot generation in sample efficiency and cost.
  • Open‑source implementation released under a permissive license, positioned as a reference point for future research and as a ready‑to‑use prover for developers.
  • Comprehensive evaluation on qualitatively different benchmark suites (e.g., Lean, Coq, Isabelle) demonstrating competitive results across domains.

Methodology

  1. Iterative Proof Refinement – The agent generates a partial proof step, checks it against the theorem prover’s kernel, and uses the feedback to guide the next generation. This loop continues until the proof is complete or a timeout occurs.
  2. Library Search – Before each generation, the agent queries a curated library of previously proved lemmas, retrieving the most relevant ones based on semantic similarity to the current goal.
  3. Context Management – The system maintains a lightweight context object that tracks the current sub‑goal, available hypotheses, and any imported lemmas, ensuring each generation is grounded in the correct logical environment.
  4. Model Agnosticism – The architecture is model‑agnostic; the authors plug in several popular large language models (e.g., GPT‑3.5, LLaMA, CodeBERT) to demonstrate that the same pipeline works across different underlying neural back‑ends.
  5. Evaluation Protocol – Benchmarks are split into “easy,” “medium,” and “hard” categories. For each, the authors measure success rate, number of generated tokens, and total inference cost, allowing a clear view of efficiency versus effectiveness.

Results & Findings

BenchmarkSuccess Rate (Minimal Agent)Success Rate (SOTA)Tokens per Proof (Avg.)
Lean (Easy)92 %94 %1.8 k
Coq (Medium)78 %81 %2.4 k
Isabelle (Hard)61 %65 %3.1 k
  • The minimal agent consistently outperforms single‑shot baselines by 15–30 % in sample efficiency (fewer tokens needed to reach a proof).
  • Cost analysis shows up to a 40 % reduction in GPU‑hour consumption compared to leading end‑to‑end provers, thanks to the iterative feedback loop that avoids wasted generations.
  • Across all tested models, the iterative approach narrows the performance gap between smaller, cheaper models and larger, more expensive ones, suggesting that architecture matters as much as raw model size.

Practical Implications

  • Developer Tooling: The open‑source agent can be embedded into IDE extensions for Lean, Coq, or Isabelle, providing on‑the‑fly proof suggestions without requiring heavyweight cloud inference.
  • Cost‑Effective Research: Labs with limited compute budgets can experiment with theorem‑proving pipelines using modest GPUs, accelerating prototyping of new proof‑automation ideas.
  • Education & Training: A lightweight prover lowers the barrier for students and hobbyists to explore formal verification, enabling interactive tutorials that run locally.
  • Industry Automation: Companies building formal verification pipelines (e.g., for hardware or safety‑critical software) can adopt the iterative refinement loop to improve proof‑search efficiency, potentially reducing verification turnaround times.

Limitations & Future Work

  • Scalability to Very Large Libraries: While the current library search works well for medium‑sized corpora, performance may degrade on massive, heterogeneous lemma collections without more sophisticated indexing.
  • Model Dependency: The agent’s success still hinges on the underlying language model’s reasoning capabilities; extremely small models struggle on the hardest benchmarks.
  • Proof Explainability: The iterative loop yields a sequence of generated steps, but the system does not yet provide high‑level explanations or visualizations of the proof strategy.
  • Future Directions: The authors plan to integrate learned lemma‑ranking models, explore hybrid symbolic‑neural search, and extend the framework to support interactive proof debugging and user‑guided refinement.

Authors

  • Borja Requena Pozo
  • Austin Letson
  • Krystian Nowakowski
  • Izan Beltran Ferreiro
  • Leopoldo Sarra

Paper Information

  • arXiv ID: 2602.24273v1
  • Categories: cs.AI
  • Published: February 27, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »