[Paper] AutoICE: Automatically Synthesizing Verifiable C Code via LLM-driven Evolution
Source: arXiv - 2512.07501v1
Overview
The paper presents AutoICE, a system that uses large language models (LLMs) together with an evolutionary search strategy to automatically generate C code that can be formally verified against natural‑language specifications. By tackling the syntactic and semantic pitfalls that have plagued earlier auto‑formalization attempts, AutoICE pushes the reliability of AI‑generated code closer to production‑grade quality.
Key Contributions
- LLM‑driven evolutionary framework: Combines diverse code “individuals” and collaborative crossover operators to explore a richer search space than single‑pass generation.
- Self‑reflective mutation: A feedback loop where the LLM introspects on failed verification attempts, learns implicit domain knowledge, and mutates the code accordingly.
- High verification success rates: Achieves 90.36 % verified code on the standard benchmark and 88.33 % on a developer‑friendly variant, beating the previous state‑of‑the‑art by a large margin.
- Open‑source dataset and tooling: Provides a curated collection of natural‑language requirements paired with verifiable C snippets, facilitating reproducibility and further research.
Methodology
- Population Initialization – The system seeds an initial pool of candidate C programs using an LLM prompted with the natural‑language requirement. Each candidate is deliberately varied (different idioms, data‑structure choices, etc.) to avoid early convergence.
- Collaborative Crossover – Pairs of candidates exchange code fragments (e.g., function bodies, loop constructs) under the guidance of the LLM, which ensures the merged program remains syntactically correct. This mimics genetic recombination and injects fresh combinations that a single LLM pass would never produce.
- Verification Loop – Each offspring is fed to a C verification tool (e.g., Frama‑C, VeriFast). The tool returns either a proof of correctness or a counterexample/error trace.
- Self‑Reflective Mutation – When verification fails, the LLM analyzes the error trace, infers the missing implicit knowledge (e.g., needed pre‑conditions, loop invariants), and mutates the code accordingly. The mutated program re‑enters the population for the next generation.
- Selection & Termination – Programs that verify successfully are promoted; the process repeats until a verification‑successful candidate is found or a resource budget expires.
The whole pipeline is automated, requiring only the natural‑language specification as input.
Results & Findings
| Benchmark | Verification Success (AutoICE) | Prior SOTA | Improvement |
|---|---|---|---|
| Standard dataset | 90.36 % | ~78 % | +12.36 pp |
| Developer‑friendly variant | 88.33 % | 65 % | +23.33 pp |
- Error reduction: The crossover step cuts down syntax‑error propagation by ~45 % compared with a naïve single‑LLM generation loop.
- Implicit knowledge capture: Self‑reflective mutation successfully added missing invariants in 82 % of the cases where the initial code failed verification.
- Runtime: Average synthesis time per requirement stayed under 30 seconds on a single RTX 4090 GPU, making the approach practical for interactive developer tools.
Practical Implications
- Developer assistants: IDE plugins could invoke AutoICE to turn a comment or user story into a verified C function, dramatically reducing the manual effort of writing correct low‑level code.
- Safety‑critical systems: Industries such as automotive, aerospace, and medical devices can leverage AutoICE to generate code that meets formal safety standards (e.g., ISO‑26262) without requiring in‑house formal‑methods experts.
- Rapid prototyping: Teams can prototype algorithmic components in C, get immediate verification feedback, and iterate faster than with traditional test‑driven development.
- Education: AutoICE can serve as a teaching aid, showing students how formal specifications translate into concrete, provably correct implementations.
Limitations & Future Work
- Domain scope: The current evaluation focuses on algorithmic kernels and does not cover heavy I/O, concurrency, or OS‑level interactions, where verification is more challenging.
- LLM dependence: Quality hinges on the underlying LLM’s training data; rare or domain‑specific APIs may still produce incorrect snippets.
- Scalability of verification: For large codebases, the verification step can become a bottleneck; integrating incremental or modular verification techniques is an open direction.
- Extension to other languages: The authors plan to adapt the evolutionary framework to Rust, Go, and other memory‑safe languages where formal verification tools are maturing.
AutoICE demonstrates that coupling LLMs with evolutionary search and formal verification can bridge the gap between AI‑generated code and production‑ready, provably correct software—a promising step toward more trustworthy developer tools.
Authors
- Weilin Luo
- Xueyi Liang
- Haotian Deng
- Yanan Liu
- Hai Wan
Paper Information
- arXiv ID: 2512.07501v1
- Categories: cs.SE, cs.AI
- Published: December 8, 2025
- PDF: Download PDF