[Paper] Localized Calibrated Uncertainty in Code Language Models
Source: arXiv - 2512.24560v1
Overview
Large language models (LLMs) can now turn natural‑language prompts into functional source code, but the generated snippets often need tweaking before they truly match a developer’s intent. This paper introduces a way to pinpoint exactly which lines of generated code are likely to need edits, giving developers a calibrated confidence score for each region of the output.
Key Contributions
- Minimal Intent‑Aligning Patch dataset – a curated collection of LLM‑generated programs together with the smallest line‑level edits (patches) that make them pass a suite of test cases.
- Localized calibration framework – methods to assign well‑calibrated probabilities to individual code spans, indicating the odds they will be edited in the minimal patch.
- White‑box probing technique – an efficient “arbitrary‑span query” probe that leverages a small supervisor model to estimate edit probabilities across any contiguous block of code.
- Benchmark against black‑box baselines – reflective prompting and self‑consistency approaches are evaluated, showing the probe’s superior calibration (Brier Skill Score ≈ 0.2).
- Cross‑domain generalization hint – a probe trained solely on code errors exhibits some ability to flag natural‑language errors when a simple probability scaling is applied.
Methodology
-
Dataset Construction
- Generate code snippets using several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude).
- Run each snippet against a set of unit tests; when failures occur, compute the minimal line‑level edit that makes the program pass.
- Store the original snippet, the test suite, and the minimal patch – this becomes the “Intent‑Aligning Patch” dataset.
-
Calibration Objective
- For any line (or span) i in a generated program, we want a probability p_i such that the empirical frequency of that line being edited equals p_i.
- Calibration is measured using Expected Calibration Error (ECE) and Brier Skill Score (BSS).
-
Probing Approach (White‑Box)
- Train a lightweight “supervisor” model (e.g., a 1‑Billion‑parameter transformer) to predict edit probabilities from internal activations of the large code LLM.
- Implement an arbitrary‑span query: the probe can efficiently aggregate token‑level signals to produce a probability for any contiguous block without re‑running the whole model.
-
Black‑Box Baselines
- Reflective prompting: ask the LLM to self‑diagnose its output (“Which lines might be wrong?”).
- Self‑consistency: generate multiple samples and compute variance across them as a proxy for uncertainty.
-
Evaluation
- Compare calibration error and BSS across probes and baselines on a held‑out test set of patches.
- Test cross‑domain transfer by applying the code‑trained probe to natural‑language error detection tasks with a simple scaling adjustment.
Results & Findings
| Method | ECE ↓ | Brier Skill Score ↑ |
|---|---|---|
| White‑box probe (small supervisor) | 0.07 | 0.20 |
| Reflective prompting | 0.18 | 0.05 |
| Self‑consistency variance | 0.15 | 0.07 |
- The probe’s calibration error is ~3× lower than the best black‑box baseline.
- Even though the supervisor model is orders of magnitude smaller than the target LLM, it can reliably predict which lines will be edited.
- When the probe’s output is linearly rescaled, it modestly flags grammatical errors in plain English, suggesting the learned signal captures a more general notion of “uncertainty”.
Practical Implications
- IDE Integration: Plug the probe into code editors to highlight “high‑risk” lines in real time, letting developers focus their review where it matters most.
- Automated Refactoring Tools: Use calibrated edit probabilities to prioritize automated fixes or suggest targeted test generation.
- Continuous Integration (CI): CI pipelines could automatically flag generated code that exceeds a calibrated risk threshold, prompting human review before merge.
- Model‑agnostic Oversight: Because the probe works with internal activations, it can be attached to any future, larger code LLM without retraining the whole model, offering a lightweight safety layer.
- Cost Savings: By narrowing the debugging surface, teams can reduce the time spent on post‑generation edits, especially in large‑scale code‑generation workflows (e.g., scaffolding microservices, data‑pipeline scripts).
Limitations & Future Work
- Dataset Scope: The patches are derived from relatively small, self‑contained programming tasks; scaling to large, multi‑file projects may expose new failure modes.
- Probe Generalization: While early results hint at cross‑domain transfer, the probe still requires task‑specific scaling to handle natural‑language errors effectively.
- Dependency on Test Suites: The notion of “minimal edit” hinges on the quality and completeness of the provided tests; flaky or under‑specified tests could bias calibration.
- Black‑Box Alternatives: More sophisticated ensemble or Bayesian approaches might close the gap further, a direction the authors suggest exploring.
Bottom line: This work offers a practical, calibrated uncertainty signal for code‑generating LLMs, turning the “black‑box” output into a more actionable artifact for developers and tooling ecosystems.
Authors
- David Gros
- Prem Devanbu
Paper Information
- arXiv ID: 2512.24560v1
- Categories: cs.SE, cs.AI
- Published: December 31, 2025
- PDF: Download PDF