[Paper] Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMs

Published: 3 weeks ago (January 15, 2026 at 10:14 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2601.10496v1

Overview

Large language models (LLMs) are now a staple for code generation and automated debugging, yet they can still churn out buggy snippets that echo mistakes from their training data. This paper introduces an exposure‑aware evaluation framework that measures how a model’s prior “seeing” of buggy versus fixed code sways its preference when asked to complete or rank code. By probing this bias, the authors reveal a hidden risk: LLMs may unintentionally propagate memorised errors.

Key Contributions

Exposure‑aware benchmark: Combines the ManySStuBs4J bug‑fix dataset with “Data Portraits” membership tests on the Stack‑V2 corpus to infer whether each buggy or fixed variant was likely present in the model’s training set.
Stratified analysis: Groups examples by exposure status (neither seen, bug‑only seen, fix‑only seen, both seen) and evaluates model behavior across these strata.
Dual evaluation lenses:
- Generation: Measures how often the model reproduces the buggy line versus the fixed line in a code‑completion setting.
- Likelihood scoring: Applies several token‑probability metrics (min/max token prob, Gini coefficient, etc.) to see which variant the model deems more probable.
Empirical findings: Shows that models are far more likely to regenerate buggy code when the bug was exposed during training, while likelihood‑based scores generally favor the correct fix—except for the Gini metric under bug‑only exposure.
Risk articulation: Highlights that standard bug‑fix evaluation can be skewed by data exposure, suggesting that LLMs could unintentionally spread historic coding mistakes.

Methodology

Dataset preparation – The authors start with ManySStuBs4J, a curated collection of real‑world Java single‑statement bugs and their corresponding fixes.
Exposure estimation – Using Data Portraits (a fast membership‑testing technique), each buggy and fixed snippet is checked against the Stack‑V2 corpus (the public data that likely fed into many code LLMs). This yields a binary “seen / not seen” label for each variant.
Stratification – Examples are split into four buckets:
- None: Neither buggy nor fixed code seen.
- Bug‑only: Only the buggy version appears in the corpus.
- Fix‑only: Only the fixed version appears.
- Both: Both variants appear.
Model probing – Two families of LLMs (e.g., Codex, StarCoder) are queried in a code‑completion mode, prompting them to fill in the missing line. The output is classified as reproducing the bug, the fix, or something else.
Likelihood scoring – For each prompt, the model’s token‑level probabilities are aggregated using several metrics:
- Minimum token probability
- Maximum token probability
- Average token probability
- Gini coefficient (measures probability concentration)
  The variant with the higher score is considered the model’s “preference.”
Statistical analysis – The authors compare generation frequencies and scoring preferences across exposure strata, using chi‑square tests and confidence intervals to assess significance.

Results & Findings

Exposure distribution: 67 % of the bug‑fix pairs have no exposure (neither variant seen). When exposure exists, fixes are more often present than bugs.
Generation bias:
- Overall, models reproduce the buggy line ~2× more often than the fixed line.
- In the bug‑only bucket, this bias spikes: buggy code is regenerated in ≈80 % of completions.
- In the fix‑only bucket, the improvement is modest; buggy lines still appear in ≈55 % of completions.
Likelihood scoring:
- Min/Max token‑probability metrics consistently assign higher scores to the fixed variant across all exposure conditions.
- The Gini coefficient flips its preference when only the buggy variant was seen, favoring the bug instead of the fix.
Interpretation: Token‑probability‑based scoring appears to capture an intrinsic bias toward correct code, whereas generation is heavily swayed by memorised patterns from the training data.

Practical Implications

Debug‑assistant reliability: Tools that rely on LLMs for automated bug fixing must account for the possibility that the model has memorised the buggy pattern, especially if the training data includes that exact mistake.
Evaluation pipelines: Benchmarks that simply compare generated code against a ground‑truth fix may over‑estimate model competence if they ignore exposure. Incorporating exposure‑aware stratification can give a more realistic picture.
Data curation: Curators of code corpora for LLM training should consider filtering or annotating known bugs to reduce the risk of propagating them.
Prompt engineering: Adding explicit “fix this bug” cues or providing surrounding context that highlights the error can mitigate the model’s tendency to repeat memorised bugs.
Safety nets: Integrating post‑generation static analysis or test‑generation steps can catch memorised bugs before they reach production.

Limitations & Future Work

Exposure inference is approximate: Membership testing via Data Portraits cannot guarantee that a snippet was truly present in the model’s training set, especially for private or proprietary data sources.
Language & dataset scope: The study focuses on Java single‑statement bugs; results may differ for multi‑line fixes, other languages, or more complex refactorings.
Model diversity: Only a handful of publicly known code LLMs were evaluated; newer or domain‑specific models could behave differently.
Future directions:
- Extending the framework to multi‑language, multi‑statement bug corpora.
- Investigating mitigation strategies (e.g., exposure‑aware fine‑tuning, contrastive training).
- Exploring how exposure interacts with other biases such as style conformity or API usage patterns.

Bottom line: While LLMs show a promising inclination toward correct code when judged by token probabilities, their generation behavior can still be hijacked by memorised bugs. Recognising and accounting for exposure is essential for building trustworthy AI‑driven development tools.

Authors

Ali Al‑Kaswan
Claudio Spiess
Prem Devanbu
Arie van Deursen
Maliheh Izadi

Paper Information

arXiv ID: 2601.10496v1
Categories: cs.SE, cs.AI
Published: January 15, 2026
PDF: Download PDF

[Paper] Model See, Model Do? Exposure-Aware Evaluation of Bug-vs-Fix Preference in Code LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] ShapeR: Robust Conditional 3D Shape Generation from Casual Captures

[Paper] MetaboNet: The Largest Publicly Available Consolidated Dataset for Type 1 Diabetes Management