[Paper] Teaching LLMs Program Semantics via Symbolic Execution Traces

Published: (May 7, 2026 at 09:01 AM EDT)
5 min read
Source: arXiv

Source: arXiv - 2605.06184v1

Overview

A new study shows that large language models (LLMs) can be taught to understand the semantics of C programs by exposing them to symbolic‑execution traces—detailed, step‑by‑step representations of how a program would behave. By fine‑tuning a modest 8‑billion‑parameter model (Qwen‑3‑8B) on just a few thousand of these traces, the authors dramatically improve the model’s ability to spot bugs such as memory‑safety violations, overflows, and data races, closing a gap that standard LLMs leave wide open.

Key Contributions

  • A 500‑task benchmark covering five verification properties (memory safety, overflow, termination, reachability, data races) built on the SV‑COMP 2025 suite.
  • Comprehensive evaluation of 14 LLMs from six families, revealing that most models excel at confirming correct behavior but struggle to detect violations, especially in longer programs.
  • Trace‑based pre‑training: using ~3 k symbolic‑execution bug traces from the Soteria engine to continue pre‑training Qwen‑3‑8B.
  • Chain‑of‑thought (CoT) prompting at inference time, combined with trace‑trained models, yields a +17 pp boost in violation‑detection accuracy.
  • Super‑additive effect: trace training and CoT together outperform either technique alone, even when the training traces target only a subset of the five properties.
  • Efficiency win: the 8 B trace‑trained model outperforms a 32 B baseline that receives no trace training, and narrows the gap to the larger model when CoT is applied.
  • Ablation study (28 configurations) confirming that gains stem from the semantic content of traces, not just extra code exposure, and that trace format/curation matters.

Methodology

  1. Benchmark construction – The authors curated 500 C programs from the SV‑COMP 2025 competition, labeling each with one of five verification properties and whether the property holds (safe) or is violated (bug).
  2. Baseline evaluation – 14 off‑the‑shelf LLMs (including LLaMA, GPT‑NeoX, and Qwen families) were prompted to answer “holds/violates” for each task, both with and without chain‑of‑thought reasoning.
  3. Trace generation – Using the open‑source Soteria symbolic execution engine, they ran each buggy program and recorded the execution trace that leads to the violation (variable states, path constraints, etc.). About 3 k distinct bug traces were collected.
  4. Continued pre‑training – The Qwen‑3‑8B model was further trained on a mixed dataset: the original code snippets + the symbolic traces, formatted as <code> → <trace>.
  5. Inference with CoT – At test time, prompts were augmented with a “think step” instruction, encouraging the model to reason through the trace before answering.
  6. Ablations – Variations included (a) training on raw code only, (b) training on traces without CoT, (c) different trace granularities, and (d) scaling the number of traces.

Results & Findings

Model (size)Overall AccuracyViolation‑Detection Δ (vs. baseline)
Qwen‑3‑8B (baseline)71 %
Qwen‑3‑8B + trace pre‑train + CoT88 %+17 pp
Qwen‑3‑32B (baseline, no trace)84 %
Qwen‑3‑32B + CoT86 %+2 pp
Qwen‑3‑8B + trace pre‑train (no CoT)73 %+2 pp
  • Balanced performance: The trace‑trained 8 B model achieves a near‑symmetric confusion matrix—detecting bugs as reliably as confirming safety.
  • Length sensitivity: Violation detection on programs >200 LOC drops >20 pp for vanilla models; the trace‑trained model’s drop is <5 pp.
  • Cross‑property transfer: Even though training traces only covered memory‑safety bugs, detection improvements were observed for overflow, termination, and data‑race tasks.
  • Format matters: Traces that include explicit constraint‑solving steps (e.g., “assert x ≤ MAX”) yield higher gains than raw symbolic states alone.

Practical Implications

  • Developer tooling: Embedding a trace‑aware LLM into IDEs could provide semantic bug hints that go beyond pattern‑matching linters, especially for hard‑to‑detect issues like data races.
  • Automated code review: Companies can fine‑tune a modest‑size model on a curated set of execution traces from their own codebase, achieving verification‑level insight without the overhead of full symbolic execution for every PR.
  • Cost‑effective verification: The study demonstrates that a small, inexpensive model (8 B) can rival much larger models when enriched with semantic traces, lowering compute budgets for continuous integration pipelines.
  • Education & onboarding: Interactive “think‑aloud” explanations generated from traces can help junior developers understand why a piece of code is unsafe, bridging the gap between static analysis reports and human intuition.

Limitations & Future Work

  • Trace diversity: The training set contains only ~3 k bug traces from open‑source projects; rare or domain‑specific bugs may still be missed.
  • Scalability to larger codebases: While trace training mitigates length‑related degradation, the approach has not been tested on programs exceeding several thousand lines or on multi‑module projects.
  • Property coverage: Only five verification properties were examined; extending to security‑focused properties (e.g., taint analysis) remains open.
  • Model generality: Experiments focused on Qwen; replicating the gains on other architectures (e.g., LLaMA, Mistral) needs validation.
  • Human‑in‑the‑loop: Future work could explore how developers can supply custom traces for proprietary code, creating a feedback loop that continuously improves the model’s semantic understanding.

Authors

  • Jonas Bayer
  • Stefan Zetzsche
  • Olivier Bouissou
  • Remi Delmas
  • Michael Tautschnig
  • Soonho Kong

Paper Information

  • arXiv ID: 2605.06184v1
  • Categories: cs.SE, cs.LG, cs.PL
  • Published: May 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coar...