[Paper] Consistent or Sensitive? Automated Code Revision Tools Against Semantics-Preserving Perturbations

Published: (February 16, 2026 at 04:58 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.14595v1

Overview

Automated Code Revision (ACR) tools promise to turn code review comments into ready‑to‑merge patches without a developer’s manual typing. This paper asks a deceptively simple question: Do these tools behave consistently when the same logical bug is presented in slightly different syntactic forms? By probing five state‑of‑the‑art transformer‑based ACR models with thousands of semantics‑preserving code variants, the authors reveal a worrying drop in correctness—up to 45 percentage points—highlighting a hidden reliability gap that could hinder real‑world adoption.

Key Contributions

  • Definition of “consistency” for ACR tools: the ability to produce the same (or equally correct) revision for semantically equivalent code snippets.
  • Nine semantics‑preserving perturbations (SPPs) (e.g., identifier renaming, statement reordering, dead‑code insertion) that systematically alter code without changing its meaning.
  • Large‑scale empirical dataset: 2,032 Java methods from diverse GitHub projects, expanded to >10 K perturbed variants for robust evaluation.
  • Comprehensive consistency benchmark across five transformer‑based ACR models (e.g., CodeT5, PLBART), quantifying performance degradation per perturbation type.
  • Exploratory mitigation attempts using input‑representation tweaks (attention‑guiding heuristics) and analysis of why they fall short.

Methodology

  1. Collect real‑world Java methods that have associated reviewer comments and ground‑truth revisions.
  2. Apply nine SPPs to each method, generating multiple syntactically different but semantically identical versions (e.g., swapping if/else branches, adding no‑op statements).
  3. Run each perturbed method through five pre‑trained transformer ACR tools, feeding the original reviewer comment as the prompt.
  4. Measure correctness by comparing the tool’s output against the known human revision using exact‑match and functional equivalence metrics.
  5. Analyze consistency by tracking how often the same tool produces a correct revision across the original and each perturbed variant.
  6. Test mitigation ideas: prepend token‑level hints, reorder input lines, or mask certain tokens to steer the model’s attention, then re‑evaluate.

The pipeline is fully automated, enabling reproducible large‑scale testing without manual labeling of each perturbed case.

Results & Findings

  • Consistency loss is substantial: The best‑performing model’s accuracy fell from ~78 % on original code to ~33 % on certain perturbed variants—a 45.3 % absolute drop.
  • Perturbation proximity matters: Changes that touch the exact region referenced in the review comment (e.g., renaming a variable mentioned in the comment) cause the steepest performance decline.
  • Model‑specific patterns: Some transformers are more robust to structural changes (e.g., statement reordering) but brittle to lexical tweaks (identifier renaming).
  • Mitigation attempts yield marginal gains: Adding heuristic tokens or masking noisy identifiers improved consistency by only 2–4 % on average, indicating that the problem is deeper than input formatting.
  • Error analysis shows models often “over‑fit” to surface forms, treating the perturbed code as a new problem rather than recognizing the underlying semantic equivalence.

Practical Implications

  • Tool reliability in CI pipelines: Developers integrating ACR tools into continuous integration should expect inconsistent behavior when code evolves slightly (e.g., after a refactor) and may need fallback manual review.
  • Need for robustness‑aware training: Training data should include diverse syntactic variants of the same bug to teach models invariance to superficial changes.
  • Potential for hybrid workflows: Pairing ACR with a lightweight static analysis pass that normalizes code (e.g., canonical identifier names) could reduce inconsistency, though the paper shows simple heuristics aren’t enough.
  • Impact on code‑review bots: Companies deploying bots that auto‑suggest patches must account for false‑positive regressions caused by minor edits, possibly by adding a “confidence” threshold before auto‑applying a revision.
  • Guidance for API designers: When exposing ACR functionality (e.g., via IDE plugins), expose the model’s uncertainty and allow developers to review multiple candidate patches.

Limitations & Future Work

  • Language scope: The study focuses exclusively on Java; other languages with different idioms (e.g., Python’s dynamic typing) may exhibit different consistency patterns.
  • Perturbation set: While nine SPPs cover many common refactorings, they are not exhaustive; real‑world code can undergo more complex transformations (e.g., API migrations).
  • Evaluation metrics: Reliance on exact‑match and functional equivalence may miss nuanced quality differences (readability, style).
  • Mitigation exploration: The paper only scratches the surface of robustness techniques (e.g., data augmentation, adversarial training, contrastive learning). Future work could integrate these methods or develop model architectures that explicitly reason about program semantics (e.g., graph‑based encoders).

Bottom line: Even the most advanced transformer‑based ACR tools can stumble dramatically when faced with harmless syntactic wiggles. For developers hoping to automate away code‑review toil, the takeaway is clear—robustness to semantics‑preserving changes is still an open research frontier, and practical deployments should be built with safeguards and human oversight.

Authors

  • Shirin Pirouzkhah
  • Souhaila Serbout
  • Alberto Bacchelli

Paper Information

  • arXiv ID: 2602.14595v1
  • Categories: cs.SE
  • Published: February 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] A Calculus of Overlays

Just as the λ-calculus uses three primitives (abstraction, application, variable) as the foundation of functional programming, Overlay-Calculus uses three primi...