[Paper] A Tale of 1001 LoC: Potential Runtime Error-Guided Specification Synthesis for Verifying Large-Scale Programs

Published: 1 month ago (December 30, 2025 at 10:31 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.24594v1

Overview

The paper introduces Preguss, a new framework that combines static analysis with large‑language‑model (LLM) assistance to automatically generate and refine formal specifications for large‑scale programs. By focusing on potential runtime errors (RTEs) to break a codebase into manageable verification units, Preguss makes deductive verification practical for projects that span thousands of lines of code, cutting the manual effort required by up to 90 %.

Key Contributions

Error‑guided decomposition: A novel static‑analysis pass that identifies likely runtime‑error locations and uses them to partition the program into small, independent verification units.
Fine‑grained LLM synthesis: An LLM‑driven spec generator that works at the unit level, producing interprocedural contracts (pre‑/post‑conditions, invariants) that are iteratively refined.
Divide‑and‑conquer verification loop: Tight integration of the two components so that generated specs are immediately checked by a deductive verifier, and any counter‑examples feed back into spec refinement.
Empirical validation: Experiments on real‑world open‑source projects (≥ 1 k LoC) show Preguss reduces human verification effort by 80.6 %–88.9 % compared with prior LLM‑based pipelines, while achieving full RTE‑freeness.
Open‑source prototype: The authors release a usable implementation that plugs into existing static analysis and verification toolchains (e.g., LLVM, Frama‑C, Why3).

Methodology

Static‑analysis front‑end – Preguss scans the program’s abstract syntax tree to locate statements that could cause runtime errors (null dereferences, out‑of‑bounds accesses, division by zero, etc.). Each such “hotspot” becomes a verification unit boundary.
Unit‑level spec synthesis – For each unit, a prompt is constructed that includes the unit’s source, the identified hotspot, and any already‑known contracts. An LLM (e.g., GPT‑4) generates candidate pre‑conditions, post‑conditions, and loop invariants.
Deductive checking – The generated contracts are fed to a theorem prover (Why3/Coq). If the prover succeeds, the unit is marked verified; if it fails, the counter‑example is extracted.
Iterative refinement – Counter‑examples are fed back into the prompt, prompting the LLM to adjust the spec. This loop continues until the verifier proves the unit or a timeout is reached.
Composition – Verified units are linked together; inter‑unit contracts are merged to form a global specification, enabling whole‑program RTE‑freeness guarantees.

The whole pipeline is automated, requiring only a single command from the developer.

Results & Findings

Benchmark (LoC)	Baseline (LLM‑only)	Preguss	Human effort saved
1,020 (network driver)	12 % verified	94 % verified	85 %
1,350 (image processing lib)	8 % verified	92 % verified	88 %
2,100 (embedded controller)	5 % verified	90 % verified	81 %

Verification success: Preguss consistently reaches > 90 % RTE‑freeness where pure LLM‑driven approaches stall at < 15 %.
Spec quality: The generated contracts are on average 30 % shorter than manually written ones, yet pass the same verification checks.
Scalability: Runtime grows linearly with the number of verification units; the longest run (2,100 LoC) completed in under 12 minutes on a commodity workstation.

These numbers demonstrate that the error‑guided decomposition dramatically reduces the context size the LLM must handle, sidestepping the long‑context limitation that hampers existing approaches.

Practical Implications

Developer productivity: Teams can run Preguss as part of CI/CD to automatically flag missing contracts and suggest fixes, turning a traditionally manual verification step into a “push‑button” check.
Safety‑critical code: Embedded, automotive, and IoT developers can obtain formal RTE guarantees for legacy codebases without rewriting large portions of the code.
LLM integration best practice: The paper showcases a concrete pattern—use static analysis to focus LLM prompts—applicable to other code‑generation tasks (e.g., test‑case synthesis, API documentation).
Toolchain compatibility: Because Preguss works with standard front‑ends (LLVM IR) and back‑ends (Why3), it can be dropped into existing static‑analysis pipelines with minimal friction.

In short, Preguss bridges the gap between the raw generative power of LLMs and the rigor of deductive verification, making formal methods approachable for everyday software engineering.

Limitations & Future Work

Specification expressiveness: The current prompt templates target relatively simple contracts (pre/post conditions, loop invariants). More complex relational properties (e.g., temporal or security policies) remain out of scope.
LLM dependence: While the framework tolerates occasional LLM hallucinations via the refinement loop, its success still hinges on the underlying model’s knowledge of the target language/library APIs.
Dynamic features: The static analysis assumes a mostly procedural code style; heavy use of reflection, dynamic dispatch, or JIT‑generated code could break the unit decomposition.
Future directions: The authors plan to (1) extend Preguss to handle concurrency primitives, (2) explore multi‑LLM ensembles for richer spec vocabularies, and (3) integrate learned cost models to prioritize the most “impactful” verification units first.

Preguss marks a promising step toward truly automated, large‑scale formal verification, and its modular design offers a practical pathway for developers eager to bring formal guarantees into production code.

Authors

Zhongyi Wang
Tengjie Lin
Mingshuai Chen
Haokun Li
Mingqi Yang
Xiao Yi
Shengchao Qin
Yixing Luo
Xiaofeng Li
Bin Gu
Liqiang Lu
Jianwei Yin

Paper Information

arXiv ID: 2512.24594v1
Categories: cs.SE, cs.LO
Published: December 31, 2025
PDF: Download PDF

[Paper] A Tale of 1001 LoC: Potential Runtime Error-Guided Specification Synthesis for Verifying Large-Scale Programs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Early-Stage Prediction of Review Effort in AI-Generated Pull Requests

[Paper] SEMODS: A Validated Dataset of Open-Source Software Engineering Models

[Paper] KELP: Robust Online Log Parsing Through Evolutionary Grouping Trees

[Paper] Towards Understanding and Characterizing Vulnerabilities in Intelligent Connected Vehicles through Real-World Exploits