[Paper] On Reasoning-Centric LLM-based Automated Theorem Proving

Published: 2 days ago (April 21, 2026 at 11:11 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.19558v1

Overview

The paper presents ReCent‑Prover, a new proof‑assistant agent that couples large language models (LLMs) with the Coq theorem prover. By giving the LLM a more “reasoning‑centric” role—explicitly planning proof steps and critiquing its own output—the system dramatically improves automated theorem proving on the CoqStoq benchmark while staying within the same budget of LLM calls.

Key Contributions

Validation with Reflection – a feedback loop where the LLM checks its generated tactics, produces a concise failure summary when a tactic is likely wrong, and discards it before it reaches Coq.
Retrieval with Planning – instead of retrieving lemmas solely by sub‑goal similarity, the LLM first sketches a high‑level proof plan and then fetches lemmas that fit that plan, aligning retrieval with strategic intent.
Budget‑aware Design – both new components increase the number of LLM invocations, but the authors keep the total inference budget constant, showing that smarter use of calls beats brute‑force generation.
Empirical Gain – on the CoqStoq benchmark, ReCent‑Prover lifts the proved‑theorem count by 22.58 % relative to the previous best system.

Methodology

Prompt Engineering for Planning – the LLM receives a description of the theorem and is asked to output a short, high‑level proof outline (e.g., apply induction on n, then use lemma X).
Plan‑Conditioned Retrieval – the outline is used as a query to a lemma database. Only lemmas that match the intended tactic (e.g., induction lemmas, arithmetic lemmas) are retrieved, reducing noise.
Tactic Generation – the LLM synthesizes concrete Coq tactics based on the retrieved lemmas and the plan.
Reflection Loop – before sending the tactic to Coq, the system runs a lightweight static check (type‑checking, goal‑matching) and asks the LLM to reflect on any mismatch. The LLM returns a short “failure summary” that explains why the tactic may fail, allowing the system to discard or revise it without a costly Coq call.
Iterative Execution – the process repeats until the proof is completed or the LLM budget is exhausted.

All steps are orchestrated by a thin controller written in Python; the heavy lifting (language understanding, planning, reflection) is delegated to a standard LLM (e.g., GPT‑4 or Claude) via API calls.

Results & Findings

Proved Theorems: ReCent‑Prover solves 22.58 % more theorems than the prior state‑of‑the‑art on CoqStoq, under the same number of LLM invocations.
Efficiency: The reflection step filters out ~30 % of invalid tactics early, cutting down the number of expensive Coq interactions.
Strategic Retrieval: Plan‑conditioned retrieval improves lemma relevance by ~45 % compared with pure sub‑goal similarity retrieval, leading to shorter proof scripts.
Ablation Study: Removing either reflection or planning drops performance back to baseline levels, confirming that both components are essential.

Practical Implications

Developer‑Facing Proof Assistants: Integrating a reasoning‑centric LLM can make interactive theorem provers more autonomous, reducing the manual “search‑and‑apply” cycle that developers currently endure.
CI/CD for Formal Verification: Faster, more reliable automated proof generation means formal verification steps can be baked into continuous‑integration pipelines without prohibitive compute costs.
Tooling for Formal Methods Education: Students can benefit from an assistant that not only suggests tactics but also explains why a suggestion fails, turning errors into learning moments.
Cross‑Domain Retrieval: The plan‑conditioned retrieval idea can be ported to other domains (e.g., code synthesis, data‑pipeline construction) where a high‑level plan should guide the selection of reusable components.

Limitations & Future Work

LLM Dependency: The approach hinges on the quality of the underlying LLM; cheaper models may not provide reliable reflection or planning.
Scalability of Lemma Database: Retrieval speed can become a bottleneck as the library of lemmas grows; indexing strategies need further research.
Generalization Beyond Coq: The system is tightly coupled to Coq’s tactic language; adapting it to other proof assistants (Isabelle, Lean) will require non‑trivial engineering.
Budget Sensitivity: While the authors keep the LLM call budget fixed, real‑world deployments may have stricter latency or cost constraints, prompting exploration of more aggressive pruning or caching techniques.

Overall, ReCent‑Prover demonstrates that giving LLMs a more strategic, self‑critical role can unlock substantial gains in automated theorem proving—an insight that could reshape how developers leverage AI for formal verification.

Authors

Yican Sun
Chengwei Shi
Hangzhou Lyu
Yingfei Xiong

Paper Information

arXiv ID: 2604.19558v1
Categories: cs.SE
Published: April 21, 2026
PDF: Download PDF

[Paper] On Reasoning-Centric LLM-based Automated Theorem Proving

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] CrossCommitVuln-Bench: A Dataset of Multi-Commit Python Vulnerabilities Invisible to Per-Commit Static Analysis

[Paper] Institutionalizing Best Practices in Research Computing: A Framework and Case Study for Improving User Onboarding

[Paper] Generalizing Test Cases for Comprehensive Test Scenario Coverage

[Paper] Less Is More: Measuring How LLM Involvement affects Chatbot Accuracy in Static Analysis