[Paper] Can LLMs Recover Program Semantics? A Systematic Evaluation with Symbolic Execution

Published: (November 24, 2025 at 08:55 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.19130v1

Overview

The paper investigates whether large language models (LLMs) can “undo” code obfuscation—a common hurdle for developers trying to understand, test, or secure software. By fine‑tuning LLMs with data generated from symbolic execution (using the KLEE engine), the authors show that LLMs can recover the original program semantics more reliably than with plain fine‑tuning alone.

Key Contributions

  • Benchmark suite for deobfuscation – 4 classic obfuscation techniques (control‑flow flattening, opaque predicates, arithmetic encoding, branch encoding) applied to a diverse set of C programs from the TUM Obfuscation Benchmarks, LLVM test suite, and algorithmic repositories.
  • Hybrid training pipeline – baseline fine‑tuning on obfuscated/original code pairs plus an “enhanced” mode that injects KLEE‑generated artifacts (SMT constraints, path statistics, concrete test cases).
  • Comprehensive evaluation metrics – compilation success (syntactic correctness), behavioral equivalence under symbolic execution (semantic fidelity), and code‑quality scores (readability/structure).
  • Empirical finding – GPT‑4.1‑mini consistently outperforms other state‑of‑the‑art LLMs, and the KLEE‑augmented training improves semantic preservation across the board.
  • Practical insight – demonstrates that coupling LLMs with symbolic execution can boost automated testing, static analysis, and program comprehension even when code is deliberately or unintentionally obfuscated.

Methodology

  1. Obfuscation generation – The authors programmatically applied four well‑studied transformations to each source file, creating multiple obfuscated variants per original program.
  2. Symbolic execution data collection – For every original/obfuscated pair, KLEE was run to produce:
    • SMT (Satisfiability Modulo Theories) constraints describing each execution path,
    • Path‑level statistics (e.g., number of branches, loop bounds), and
    • Concrete test cases that achieve high path coverage.
  3. Model fine‑tuning – Three leading LLMs (including GPT‑4.1‑mini) were fine‑tuned under two regimes:
    • Baseline: simple “obfuscated → original” code translation pairs.
    • Enhanced: same pairs plus the KLEE artifacts concatenated as auxiliary context.
  4. Evaluation pipeline – After generation, each deobfuscated output was:
    • Compiled to verify syntactic correctness,
    • Run through KLEE again to check if the symbolic behavior matches the original program, and
    • Scored with readability metrics (e.g., cyclomatic complexity, naming conventions).

The whole workflow is fully automated, enabling reproducible, large‑scale testing.

Results & Findings

ModelBaseline (Compilation %)Baseline (Semantic Equivalence %)Enhanced (+KLEE) Compilation %Enhanced Semantic %
GPT‑4.1‑mini71588982
LLaMA‑2‑13B63497871
CodeBERT‑large55427064
  • Compilation success jumps 15‑25 pp when KLEE artifacts are added, indicating that the extra semantic hints guide the model away from syntactically illegal rewrites.
  • Semantic fidelity (behavioral equivalence under symbolic execution) improves by ~20 pp across all models, confirming that the deobfuscated code actually does the same thing as the original.
  • Readability scores rise modestly; the generated code tends to have clearer control flow and more conventional naming after KLEE‑augmented training.
  • GPT‑4.1‑mini consistently produces the most reliable deobfuscations, likely due to its larger context window and stronger reasoning capabilities.

Practical Implications

  • Automated reverse engineering – Security teams can feed obfuscated binaries into a pipeline that first extracts a C representation (e.g., via decompilation) and then uses a KLEE‑enhanced LLM to recover a human‑readable version, speeding up vulnerability triage.
  • Robust static analysis – Tools that rely on source‑level information (linters, type checkers, formal verifiers) can pre‑process third‑party libraries with this approach, mitigating the “black‑box” effect of compiler‑generated optimizations or intentional obfuscation.
  • Continuous integration & testing – When build pipelines apply aggressive optimizations that unintentionally obscure code, a deobfuscation step can regenerate a clean version for downstream testing frameworks, preserving test coverage metrics.
  • Developer productivity – In large codebases where legacy modules have been minified or heavily macro‑expanded, developers can request a “semantic clean‑up” from an LLM, receiving code that compiles and behaves identically but is easier to read and modify.

The key takeaway for practitioners is that symbolic execution data acts as a powerful “semantic anchor” for LLMs, turning them from pure pattern‑matchers into tools that respect program behavior.

Limitations & Future Work

  • Language scope – The study focuses exclusively on C programs; extending to C++, Rust, or managed languages may surface new challenges (e.g., templates, garbage collection).
  • Scalability of KLEE – Symbolic execution can be expensive for large codebases; the authors note that path explosion limited the size of programs they could process.
  • Obfuscation diversity – Only four classic transformations were evaluated; modern packers, virtualization‑based obfuscators, or AI‑generated obfuscations remain untested.
  • Model size vs. cost – While GPT‑4.1‑mini performed best, running large LLMs in CI pipelines may be cost‑prohibitive; future work could explore distilled or quantized models that retain the semantic benefits.

The authors suggest exploring incremental symbolic feedback

Back to Blog

Related posts

Read more »