[Paper] AutoICE: Automatically Synthesizing Verifiable C Code via LLM-driven Evolution

Published: 1 week ago (December 8, 2025 at 07:35 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07501v1

Overview

The paper presents AutoICE, a system that uses large language models (LLMs) together with an evolutionary search strategy to automatically generate C code that can be formally verified against natural‑language specifications. By tackling the syntactic and semantic pitfalls that have plagued earlier auto‑formalization attempts, AutoICE pushes the reliability of AI‑generated code closer to production‑grade quality.

Key Contributions

LLM‑driven evolutionary framework: Combines diverse code “individuals” and collaborative crossover operators to explore a richer search space than single‑pass generation.
Self‑reflective mutation: A feedback loop where the LLM introspects on failed verification attempts, learns implicit domain knowledge, and mutates the code accordingly.
High verification success rates: Achieves 90.36 % verified code on the standard benchmark and 88.33 % on a developer‑friendly variant, beating the previous state‑of‑the‑art by a large margin.
Open‑source dataset and tooling: Provides a curated collection of natural‑language requirements paired with verifiable C snippets, facilitating reproducibility and further research.

Methodology

Population Initialization – The system seeds an initial pool of candidate C programs using an LLM prompted with the natural‑language requirement. Each candidate is deliberately varied (different idioms, data‑structure choices, etc.) to avoid early convergence.
Collaborative Crossover – Pairs of candidates exchange code fragments (e.g., function bodies, loop constructs) under the guidance of the LLM, which ensures the merged program remains syntactically correct. This mimics genetic recombination and injects fresh combinations that a single LLM pass would never produce.
Verification Loop – Each offspring is fed to a C verification tool (e.g., Frama‑C, VeriFast). The tool returns either a proof of correctness or a counterexample/error trace.
Self‑Reflective Mutation – When verification fails, the LLM analyzes the error trace, infers the missing implicit knowledge (e.g., needed pre‑conditions, loop invariants), and mutates the code accordingly. The mutated program re‑enters the population for the next generation.
Selection & Termination – Programs that verify successfully are promoted; the process repeats until a verification‑successful candidate is found or a resource budget expires.

The whole pipeline is automated, requiring only the natural‑language specification as input.

Results & Findings

Benchmark	Verification Success (AutoICE)	Prior SOTA	Improvement
Standard dataset	90.36 %	~78 %	+12.36 pp
Developer‑friendly variant	88.33 %	65 %	+23.33 pp

Error reduction: The crossover step cuts down syntax‑error propagation by ~45 % compared with a naïve single‑LLM generation loop.
Implicit knowledge capture: Self‑reflective mutation successfully added missing invariants in 82 % of the cases where the initial code failed verification.
Runtime: Average synthesis time per requirement stayed under 30 seconds on a single RTX 4090 GPU, making the approach practical for interactive developer tools.

Practical Implications

Developer assistants: IDE plugins could invoke AutoICE to turn a comment or user story into a verified C function, dramatically reducing the manual effort of writing correct low‑level code.
Safety‑critical systems: Industries such as automotive, aerospace, and medical devices can leverage AutoICE to generate code that meets formal safety standards (e.g., ISO‑26262) without requiring in‑house formal‑methods experts.
Rapid prototyping: Teams can prototype algorithmic components in C, get immediate verification feedback, and iterate faster than with traditional test‑driven development.
Education: AutoICE can serve as a teaching aid, showing students how formal specifications translate into concrete, provably correct implementations.

Limitations & Future Work

Domain scope: The current evaluation focuses on algorithmic kernels and does not cover heavy I/O, concurrency, or OS‑level interactions, where verification is more challenging.
LLM dependence: Quality hinges on the underlying LLM’s training data; rare or domain‑specific APIs may still produce incorrect snippets.
Scalability of verification: For large codebases, the verification step can become a bottleneck; integrating incremental or modular verification techniques is an open direction.
Extension to other languages: The authors plan to adapt the evolutionary framework to Rust, Go, and other memory‑safe languages where formal verification tools are maturing.

AutoICE demonstrates that coupling LLMs with evolutionary search and formal verification can bridge the gap between AI‑generated code and production‑ready, provably correct software—a promising step toward more trustworthy developer tools.

Authors

Weilin Luo
Xueyi Liang
Haotian Deng
Yanan Liu
Hai Wan

Paper Information

arXiv ID: 2512.07501v1
Categories: cs.SE, cs.AI
Published: December 8, 2025
PDF: Download PDF

[Paper] AutoICE: Automatically Synthesizing Verifiable C Code via LLM-driven Evolution

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] DiffusionBrowser: Interactive Diffusion Previews via Multi-Branch Decoders

[Paper] Feedforward 3D Editing via Text-Steerable Image-to-3D

[Paper] Directional Textual Inversion for Personalized Text-to-Image Generation

[Paper] A Scientific Reasoning Model for Organic Synthesis Procedure Generation