[Paper] Teaching LLMs Program Semantics via Symbolic Execution Traces
Source: arXiv - 2605.06184v1
Overview
A new study shows that large language models (LLMs) can be taught to understand the semantics of C programs by exposing them to symbolic‑execution traces—detailed, step‑by‑step representations of how a program would behave. By fine‑tuning a modest 8‑billion‑parameter model (Qwen‑3‑8B) on just a few thousand of these traces, the authors dramatically improve the model’s ability to spot bugs such as memory‑safety violations, overflows, and data races, closing a gap that standard LLMs leave wide open.
Key Contributions
- A 500‑task benchmark covering five verification properties (memory safety, overflow, termination, reachability, data races) built on the SV‑COMP 2025 suite.
- Comprehensive evaluation of 14 LLMs from six families, revealing that most models excel at confirming correct behavior but struggle to detect violations, especially in longer programs.
- Trace‑based pre‑training: using ~3 k symbolic‑execution bug traces from the Soteria engine to continue pre‑training Qwen‑3‑8B.
- Chain‑of‑thought (CoT) prompting at inference time, combined with trace‑trained models, yields a +17 pp boost in violation‑detection accuracy.
- Super‑additive effect: trace training and CoT together outperform either technique alone, even when the training traces target only a subset of the five properties.
- Efficiency win: the 8 B trace‑trained model outperforms a 32 B baseline that receives no trace training, and narrows the gap to the larger model when CoT is applied.
- Ablation study (28 configurations) confirming that gains stem from the semantic content of traces, not just extra code exposure, and that trace format/curation matters.
Methodology
- Benchmark construction – The authors curated 500 C programs from the SV‑COMP 2025 competition, labeling each with one of five verification properties and whether the property holds (safe) or is violated (bug).
- Baseline evaluation – 14 off‑the‑shelf LLMs (including LLaMA, GPT‑NeoX, and Qwen families) were prompted to answer “holds/violates” for each task, both with and without chain‑of‑thought reasoning.
- Trace generation – Using the open‑source Soteria symbolic execution engine, they ran each buggy program and recorded the execution trace that leads to the violation (variable states, path constraints, etc.). About 3 k distinct bug traces were collected.
- Continued pre‑training – The Qwen‑3‑8B model was further trained on a mixed dataset: the original code snippets + the symbolic traces, formatted as
<code> → <trace>. - Inference with CoT – At test time, prompts were augmented with a “think step” instruction, encouraging the model to reason through the trace before answering.
- Ablations – Variations included (a) training on raw code only, (b) training on traces without CoT, (c) different trace granularities, and (d) scaling the number of traces.
Results & Findings
| Model (size) | Overall Accuracy | Violation‑Detection Δ (vs. baseline) |
|---|---|---|
| Qwen‑3‑8B (baseline) | 71 % | – |
| Qwen‑3‑8B + trace pre‑train + CoT | 88 % | +17 pp |
| Qwen‑3‑32B (baseline, no trace) | 84 % | – |
| Qwen‑3‑32B + CoT | 86 % | +2 pp |
| Qwen‑3‑8B + trace pre‑train (no CoT) | 73 % | +2 pp |
- Balanced performance: The trace‑trained 8 B model achieves a near‑symmetric confusion matrix—detecting bugs as reliably as confirming safety.
- Length sensitivity: Violation detection on programs >200 LOC drops >20 pp for vanilla models; the trace‑trained model’s drop is <5 pp.
- Cross‑property transfer: Even though training traces only covered memory‑safety bugs, detection improvements were observed for overflow, termination, and data‑race tasks.
- Format matters: Traces that include explicit constraint‑solving steps (e.g., “assert x ≤ MAX”) yield higher gains than raw symbolic states alone.
Practical Implications
- Developer tooling: Embedding a trace‑aware LLM into IDEs could provide semantic bug hints that go beyond pattern‑matching linters, especially for hard‑to‑detect issues like data races.
- Automated code review: Companies can fine‑tune a modest‑size model on a curated set of execution traces from their own codebase, achieving verification‑level insight without the overhead of full symbolic execution for every PR.
- Cost‑effective verification: The study demonstrates that a small, inexpensive model (8 B) can rival much larger models when enriched with semantic traces, lowering compute budgets for continuous integration pipelines.
- Education & onboarding: Interactive “think‑aloud” explanations generated from traces can help junior developers understand why a piece of code is unsafe, bridging the gap between static analysis reports and human intuition.
Limitations & Future Work
- Trace diversity: The training set contains only ~3 k bug traces from open‑source projects; rare or domain‑specific bugs may still be missed.
- Scalability to larger codebases: While trace training mitigates length‑related degradation, the approach has not been tested on programs exceeding several thousand lines or on multi‑module projects.
- Property coverage: Only five verification properties were examined; extending to security‑focused properties (e.g., taint analysis) remains open.
- Model generality: Experiments focused on Qwen; replicating the gains on other architectures (e.g., LLaMA, Mistral) needs validation.
- Human‑in‑the‑loop: Future work could explore how developers can supply custom traces for proprietary code, creating a feedback loop that continuously improves the model’s semantic understanding.
Authors
- Jonas Bayer
- Stefan Zetzsche
- Olivier Bouissou
- Remi Delmas
- Michael Tautschnig
- Soonho Kong
Paper Information
- arXiv ID: 2605.06184v1
- Categories: cs.SE, cs.LG, cs.PL
- Published: May 7, 2026
- PDF: Download PDF