[Paper] RocqSmith: Can Automatic Optimization Forge Better Proof Agents?
Source: arXiv - 2602.05762v1
Overview
The paper RocqSmith: Can Automatic Optimization Forge Better Proof Agents? investigates whether modern AI‑driven optimizer frameworks—originally built for reinforcement‑learning agents or large language models—can be repurposed to improve automated theorem provers that work inside the Coq proof assistant. By treating a Coq proof‑generation bot as an “agent” and letting off‑the‑shelf optimizers tune its prompts, knowledge base, and control flow, the authors ask: can we replace painstaking hand‑crafting with a fully automatic pipeline?
Key Contributions
- Benchmarking optimizer families – The authors adapt several generic optimizer suites (e.g., Bayesian optimization, evolutionary strategies, few‑shot bootstrapping) to the task of tuning a Coq proof‑generation agent.
- Fine‑grained “agentic” parameter study – They identify three levers that matter for proof agents: prompt design, contextual knowledge injection, and control‑strategy selection, and expose them to the optimizers.
- Empirical evaluation on real Coq libraries – Experiments run on a diverse set of Coq developments (including the standard library, MathComp, and a subset of the Verified Software Toolchain) provide a realistic performance picture.
- Finding that simple few‑shot bootstrapping outperforms more complex optimizers – Despite the sophistication of Bayesian and evolutionary methods, a lightweight few‑shot prompting approach consistently yields the biggest gains.
- Open‑source tooling – The paper releases the RocqSmith framework, a plug‑and‑play wrapper that lets researchers attach any optimizer to a Coq proof agent with minimal code changes.
Methodology
- Base Proof Agent – The authors start from a baseline Coq proof‑generation system built on top of a large language model (LLM) that receives a theorem statement and returns a sequence of Coq tactics.
- Parameter Space Definition
- Prompt Templates: variations in wording, example ordering, and “chain‑of‑thought” cues.
- Contextual Knowledge: which imported lemmas, definitions, or previously proved theorems are supplied as background.
- Control Strategies: how many candidate tactic sequences are generated, when to invoke a fallback prover, and how to prune failed attempts.
- Optimizers Tested
- Bayesian Optimization (BO) – models the performance surface and selects promising configurations.
- Evolutionary Strategies (ES) – evolves a population of configurations through mutation and selection.
- Few‑Shot Bootstrapping (FSB) – iteratively augments the prompt with successful proof snippets discovered in earlier runs (a form of self‑play).
- Evaluation Protocol – For each optimizer, the system runs on a held‑out test set of 500 theorems. Success is measured by proof completion rate (percentage of theorems fully discharged) and average tactic count (efficiency).
- Baseline Comparison – Results are compared against the hand‑tuned state‑of‑the‑art Coq proof agent from prior work (the “engineered” baseline).
Results & Findings
| Optimizer | Proof Completion ↑ (vs. baseline) | Avg. Tactics ↓ (vs. baseline) |
|---|---|---|
| Bayesian Optimization | +4.2 % | –6 % |
| Evolutionary Strategies | +3.8 % | –5 % |
| Few‑Shot Bootstrapping | +9.1 % | –12 % |
| Hand‑engineered baseline | — (reference) | — (reference) |
- Few‑shot bootstrapping consistently outperformed the more heavyweight BO and ES methods, delivering the largest lift in both success rate and efficiency.
- None of the automatic optimizers reached the absolute performance of the manually engineered proof agent, which still held a ~5 % edge in completion rate.
- The optimizer‑generated prompt templates tended to converge on concise, example‑rich prompts, confirming anecdotal best practices from the theorem‑proving community.
- Contextual knowledge selection proved more sensitive: BO and ES sometimes over‑loaded the agent with irrelevant lemmas, hurting performance, whereas FSB learned to surface only the most useful facts.
Practical Implications
- Rapid Prototyping for New Domains – Developers can now plug a generic optimizer into their Coq‑based verification pipeline and obtain a “good enough” proof agent in hours rather than weeks of manual prompt engineering.
- Continuous Improvement in CI – The RocqSmith wrapper can be integrated into continuous‑integration workflows: each time a new library is added, the optimizer automatically refines prompts and context, keeping the proof automation up‑to‑date.
- Lower Barrier for Formal Methods Adoption – Teams without deep expertise in Coq tactics can rely on the few‑shot bootstrapping loop to generate usable proof scripts, accelerating the verification of safety‑critical software, cryptographic protocols, or hardware designs.
- Template for Other Proof Assistants – Although the study focuses on Coq, the same optimizer‑as‑agent pattern can be transplanted to Lean, Isabelle, or Agda, opening a path toward universal, self‑optimizing proof assistants.
- Cost‑Effective Use of LLMs – By automatically pruning unnecessary context and limiting the number of generated tactic candidates, the approach reduces API call volume to commercial LLM providers, translating into tangible cost savings.
Limitations & Future Work
- Performance Gap to Hand‑Tuned Agents – Even the best automatic optimizer falls short of expert‑crafted systems, indicating that nuanced domain knowledge (e.g., bespoke tactic combinators) remains hard to capture automatically.
- Scalability of Optimization Loop – Bayesian and evolutionary methods required many evaluation runs, which can be expensive when each proof attempt involves costly LLM inference.
- Generalization Beyond Benchmarks – The test suite, while diverse, still represents a curated set of theorems; real‑world codebases with highly specialized libraries may exhibit different behavior.
- Future Directions – The authors suggest exploring meta‑learning across multiple proof assistants, hybrid approaches that combine few‑shot bootstrapping with gradient‑based fine‑tuning of LLMs, and tighter integration with Coq’s own proof‑search mechanisms to close the performance gap.
Authors
- Andrei Kozyrev
- Nikita Khramov
- Denis Lochmelis
- Valerio Morelli
- Gleb Solovev
- Anton Podkopaev
Paper Information
- arXiv ID: 2602.05762v1
- Categories: cs.AI, cs.LG, cs.LO, cs.SE
- Published: February 5, 2026
- PDF: Download PDF