[Paper] Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMs

Published: (January 15, 2026 at 10:14 AM EST)
5 min read
Source: arXiv

Source: arXiv - 2601.10496v1

Overview

Large language models (LLMs) are now a staple for code generation and automated debugging, yet they can still churn out buggy snippets that echo mistakes from their training data. This paper introduces an exposure‑aware evaluation framework that measures how a model’s prior “seeing” of buggy versus fixed code sways its preference when asked to complete or rank code. By probing this bias, the authors reveal a hidden risk: LLMs may unintentionally propagate memorised errors.

Key Contributions

  • Exposure‑aware benchmark: Combines the ManySStuBs4J bug‑fix dataset with “Data Portraits” membership tests on the Stack‑V2 corpus to infer whether each buggy or fixed variant was likely present in the model’s training set.
  • Stratified analysis: Groups examples by exposure status (neither seen, bug‑only seen, fix‑only seen, both seen) and evaluates model behavior across these strata.
  • Dual evaluation lenses:
    • Generation: Measures how often the model reproduces the buggy line versus the fixed line in a code‑completion setting.
    • Likelihood scoring: Applies several token‑probability metrics (min/max token prob, Gini coefficient, etc.) to see which variant the model deems more probable.
  • Empirical findings: Shows that models are far more likely to regenerate buggy code when the bug was exposed during training, while likelihood‑based scores generally favor the correct fix—except for the Gini metric under bug‑only exposure.
  • Risk articulation: Highlights that standard bug‑fix evaluation can be skewed by data exposure, suggesting that LLMs could unintentionally spread historic coding mistakes.

Methodology

  1. Dataset preparation – The authors start with ManySStuBs4J, a curated collection of real‑world Java single‑statement bugs and their corresponding fixes.
  2. Exposure estimation – Using Data Portraits (a fast membership‑testing technique), each buggy and fixed snippet is checked against the Stack‑V2 corpus (the public data that likely fed into many code LLMs). This yields a binary “seen / not seen” label for each variant.
  3. Stratification – Examples are split into four buckets:
    • None: Neither buggy nor fixed code seen.
    • Bug‑only: Only the buggy version appears in the corpus.
    • Fix‑only: Only the fixed version appears.
    • Both: Both variants appear.
  4. Model probing – Two families of LLMs (e.g., Codex, StarCoder) are queried in a code‑completion mode, prompting them to fill in the missing line. The output is classified as reproducing the bug, the fix, or something else.
  5. Likelihood scoring – For each prompt, the model’s token‑level probabilities are aggregated using several metrics:
    • Minimum token probability
    • Maximum token probability
    • Average token probability
    • Gini coefficient (measures probability concentration)
      The variant with the higher score is considered the model’s “preference.”
  6. Statistical analysis – The authors compare generation frequencies and scoring preferences across exposure strata, using chi‑square tests and confidence intervals to assess significance.

Results & Findings

  • Exposure distribution: 67 % of the bug‑fix pairs have no exposure (neither variant seen). When exposure exists, fixes are more often present than bugs.
  • Generation bias:
    • Overall, models reproduce the buggy line ~2× more often than the fixed line.
    • In the bug‑only bucket, this bias spikes: buggy code is regenerated in ≈80 % of completions.
    • In the fix‑only bucket, the improvement is modest; buggy lines still appear in ≈55 % of completions.
  • Likelihood scoring:
    • Min/Max token‑probability metrics consistently assign higher scores to the fixed variant across all exposure conditions.
    • The Gini coefficient flips its preference when only the buggy variant was seen, favoring the bug instead of the fix.
  • Interpretation: Token‑probability‑based scoring appears to capture an intrinsic bias toward correct code, whereas generation is heavily swayed by memorised patterns from the training data.

Practical Implications

  • Debug‑assistant reliability: Tools that rely on LLMs for automated bug fixing must account for the possibility that the model has memorised the buggy pattern, especially if the training data includes that exact mistake.
  • Evaluation pipelines: Benchmarks that simply compare generated code against a ground‑truth fix may over‑estimate model competence if they ignore exposure. Incorporating exposure‑aware stratification can give a more realistic picture.
  • Data curation: Curators of code corpora for LLM training should consider filtering or annotating known bugs to reduce the risk of propagating them.
  • Prompt engineering: Adding explicit “fix this bug” cues or providing surrounding context that highlights the error can mitigate the model’s tendency to repeat memorised bugs.
  • Safety nets: Integrating post‑generation static analysis or test‑generation steps can catch memorised bugs before they reach production.

Limitations & Future Work

  • Exposure inference is approximate: Membership testing via Data Portraits cannot guarantee that a snippet was truly present in the model’s training set, especially for private or proprietary data sources.
  • Language & dataset scope: The study focuses on Java single‑statement bugs; results may differ for multi‑line fixes, other languages, or more complex refactorings.
  • Model diversity: Only a handful of publicly known code LLMs were evaluated; newer or domain‑specific models could behave differently.
  • Future directions:
    • Extending the framework to multi‑language, multi‑statement bug corpora.
    • Investigating mitigation strategies (e.g., exposure‑aware fine‑tuning, contrastive training).
    • Exploring how exposure interacts with other biases such as style conformity or API usage patterns.

Bottom line: While LLMs show a promising inclination toward correct code when judged by token probabilities, their generation behavior can still be hijacked by memorised bugs. Recognising and accounting for exposure is essential for building trustworthy AI‑driven development tools.

Authors

  • Ali Al‑Kaswan
  • Claudio Spiess
  • Prem Devanbu
  • Arie van Deursen
  • Maliheh Izadi

Paper Information

  • arXiv ID: 2601.10496v1
  • Categories: cs.SE, cs.AI
  • Published: January 15, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »