[Paper] A Differential Fuzzing-Based Evaluation of Functional Equivalence in LLM-Generated Code Refactorings
Source: arXiv - 2602.15761v1
Overview
Large language models (LLMs) are increasingly being used to automatically refactor code, but guaranteeing that the refactored version behaves exactly like the original is still an open problem. This paper introduces a differential fuzzing technique that checks functional equivalence without relying on hand‑crafted test suites, and it evaluates how often popular LLMs actually preserve semantics.
Key Contributions
- Differential fuzzing framework for automatically comparing the behavior of original and LLM‑generated refactorings across thousands of random inputs.
- Large‑scale empirical study covering six state‑of‑the‑art LLMs (CodeLlama, Codestral, StarChat2, Qwen‑2.5, Olmo‑3, GPT‑4o) on three benchmark datasets and two refactoring tasks.
- Quantitative evidence that 19‑35 % of LLM‑produced refactorings introduce semantic changes, contradicting the assumption that existing test suites are sufficient.
- Detection gap analysis showing that roughly 21 % of the semantic bugs slip past the original test suites, highlighting a blind spot in current evaluation practices.
Methodology
- Dataset preparation – The authors selected three publicly available code corpora, each containing functions with an associated test suite. Two refactoring types (e.g., renaming, extracting methods) were applied.
- LLM generation – Each function was fed to the six LLMs, which returned a refactored version of the code.
- Differential fuzzing – A custom fuzzer automatically generates a large set of random inputs (including edge‑case values) for the function’s signature. Both the original and the refactored implementations are executed on each input, and their outputs (including exceptions) are compared.
- Equivalence decision – If any input yields a mismatched result, the refactoring is marked non‑equivalent. The same inputs are also run against the original test suites to see whether the test suite would have caught the discrepancy.
- Statistical analysis – Percentages of non‑equivalent refactorings and detection rates are aggregated per LLM, dataset, and refactoring type.
The approach is deliberately model‑agnostic: it treats the LLM as a black box and focuses on observable behavior rather than internal code structure.
Results & Findings
| LLM | Non‑equivalent refactorings | % of total refactorings |
|---|---|---|
| CodeLlama | 19 % | – |
| Codestral | 22 % | – |
| StarChat2 | 27 % | – |
| Qwen‑2.5 | 31 % | – |
| Olmo‑3 | 35 % | – |
| GPT‑4o | 19 % | – |
- Semantic drift is common: Even the strongest model (GPT‑4o) produced nearly one‑fifth of refactorings that changed program behavior.
- Test suites miss bugs: About 21 % of the detected non‑equivalences were not caught by the original test suites, indicating that relying solely on existing tests can give a false sense of safety.
- Refactoring type matters: Extract‑method transformations tended to be more error‑prone than simple renamings, likely because they introduce new control‑flow paths.
Overall, the differential fuzzing approach uncovered many subtle bugs that static analysis or limited test cases would overlook.
Practical Implications
- Tooling: IDE plugins or CI pipelines that integrate differential fuzzing can automatically flag risky LLM‑generated refactorings before they land in production.
- Model selection: Developers can use the reported non‑equivalence rates as a sanity check when choosing an LLM for automated refactoring tasks.
- Test suite augmentation: The uncovered gaps suggest that teams should enrich their test suites with property‑based or fuzz‑style tests, especially for critical libraries.
- Safety nets for AI‑assisted development: Companies deploying LLM‑driven code transformation services should adopt runtime equivalence checks to avoid silently introducing regressions.
In short, the paper provides a practical, scalable safety net that can be dropped into existing development workflows with minimal overhead.
Limitations & Future Work
- Input generation scope: The fuzzing strategy focuses on primitive input types and may miss bugs that involve complex objects, file I/O, or external services.
- Performance overhead: Running thousands of executions per function can be costly for large codebases; smarter input selection or sampling strategies are needed.
- Dataset bias: The three benchmark datasets are relatively small and may not represent the full diversity of real‑world codebases.
- Future directions suggested by the authors include extending the framework to handle stateful APIs, integrating symbolic execution for deeper coverage, and exploring automated repair mechanisms for detected non‑equivalences.
Authors
- Simantika Bhattacharjee Dristi
- Matthew B. Dwyer
Paper Information
- arXiv ID: 2602.15761v1
- Categories: cs.SE
- Published: February 17, 2026
- PDF: Download PDF