[Paper] Teaching LLMs Program Semantics via Symbolic Execution Traces

Published: 4 days ago (May 7, 2026 at 09:01 AM EDT)

5 min read

Source: arXiv

Source: arXiv - 2605.06184v1

Overview

A new study shows that large language models (LLMs) can be taught to understand the semantics of C programs by exposing them to symbolic‑execution traces—detailed, step‑by‑step representations of how a program would behave. By fine‑tuning a modest 8‑billion‑parameter model (Qwen‑3‑8B) on just a few thousand of these traces, the authors dramatically improve the model’s ability to spot bugs such as memory‑safety violations, overflows, and data races, closing a gap that standard LLMs leave wide open.

Key Contributions

A 500‑task benchmark covering five verification properties (memory safety, overflow, termination, reachability, data races) built on the SV‑COMP 2025 suite.
Comprehensive evaluation of 14 LLMs from six families, revealing that most models excel at confirming correct behavior but struggle to detect violations, especially in longer programs.
Trace‑based pre‑training: using ~3 k symbolic‑execution bug traces from the Soteria engine to continue pre‑training Qwen‑3‑8B.
Chain‑of‑thought (CoT) prompting at inference time, combined with trace‑trained models, yields a +17 pp boost in violation‑detection accuracy.
Super‑additive effect: trace training and CoT together outperform either technique alone, even when the training traces target only a subset of the five properties.
Efficiency win: the 8 B trace‑trained model outperforms a 32 B baseline that receives no trace training, and narrows the gap to the larger model when CoT is applied.
Ablation study (28 configurations) confirming that gains stem from the semantic content of traces, not just extra code exposure, and that trace format/curation matters.

Methodology

Benchmark construction – The authors curated 500 C programs from the SV‑COMP 2025 competition, labeling each with one of five verification properties and whether the property holds (safe) or is violated (bug).
Baseline evaluation – 14 off‑the‑shelf LLMs (including LLaMA, GPT‑NeoX, and Qwen families) were prompted to answer “holds/violates” for each task, both with and without chain‑of‑thought reasoning.
Trace generation – Using the open‑source Soteria symbolic execution engine, they ran each buggy program and recorded the execution trace that leads to the violation (variable states, path constraints, etc.). About 3 k distinct bug traces were collected.
Continued pre‑training – The Qwen‑3‑8B model was further trained on a mixed dataset: the original code snippets + the symbolic traces, formatted as <code> → <trace>.
Inference with CoT – At test time, prompts were augmented with a “think step” instruction, encouraging the model to reason through the trace before answering.
Ablations – Variations included (a) training on raw code only, (b) training on traces without CoT, (c) different trace granularities, and (d) scaling the number of traces.

Results & Findings

Model (size)	Overall Accuracy	Violation‑Detection Δ (vs. baseline)
Qwen‑3‑8B (baseline)	71 %	–
Qwen‑3‑8B + trace pre‑train + CoT	88 %	+17 pp
Qwen‑3‑32B (baseline, no trace)	84 %	–
Qwen‑3‑32B + CoT	86 %	+2 pp
Qwen‑3‑8B + trace pre‑train (no CoT)	73 %	+2 pp

Balanced performance: The trace‑trained 8 B model achieves a near‑symmetric confusion matrix—detecting bugs as reliably as confirming safety.
Length sensitivity: Violation detection on programs >200 LOC drops >20 pp for vanilla models; the trace‑trained model’s drop is <5 pp.
Cross‑property transfer: Even though training traces only covered memory‑safety bugs, detection improvements were observed for overflow, termination, and data‑race tasks.
Format matters: Traces that include explicit constraint‑solving steps (e.g., “assert x ≤ MAX”) yield higher gains than raw symbolic states alone.

Practical Implications

Developer tooling: Embedding a trace‑aware LLM into IDEs could provide semantic bug hints that go beyond pattern‑matching linters, especially for hard‑to‑detect issues like data races.
Automated code review: Companies can fine‑tune a modest‑size model on a curated set of execution traces from their own codebase, achieving verification‑level insight without the overhead of full symbolic execution for every PR.
Cost‑effective verification: The study demonstrates that a small, inexpensive model (8 B) can rival much larger models when enriched with semantic traces, lowering compute budgets for continuous integration pipelines.
Education & onboarding: Interactive “think‑aloud” explanations generated from traces can help junior developers understand why a piece of code is unsafe, bridging the gap between static analysis reports and human intuition.

Limitations & Future Work

Trace diversity: The training set contains only ~3 k bug traces from open‑source projects; rare or domain‑specific bugs may still be missed.
Scalability to larger codebases: While trace training mitigates length‑related degradation, the approach has not been tested on programs exceeding several thousand lines or on multi‑module projects.
Property coverage: Only five verification properties were examined; extending to security‑focused properties (e.g., taint analysis) remains open.
Model generality: Experiments focused on Qwen; replicating the gains on other architectures (e.g., LLaMA, Mistral) needs validation.
Human‑in‑the‑loop: Future work could explore how developers can supply custom traces for proprietary code, creating a feedback loop that continuously improves the model’s semantic understanding.

Authors

Jonas Bayer
Stefan Zetzsche
Olivier Bouissou
Remi Delmas
Michael Tautschnig
Soonho Kong

Paper Information

arXiv ID: 2605.06184v1
Categories: cs.SE, cs.LG, cs.PL
Published: May 7, 2026
PDF: Download PDF

[Paper] Teaching LLMs Program Semantics via Symbolic Execution Traces

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Normalizing Trajectory Models

[Paper] Zero-Shot Imagined Speech Decoding via Imagined-to-Listened MEG Mapping

[Paper] GRAPHLCP: Structure-Aware Localized Conformal Prediction on Graphs

[Paper] EmambaIR: Efficient Visual State Space Model for Event-guided Image Reconstruction