[Paper] Localized Calibrated Uncertainty in Code Language Models

Published: (December 30, 2025 at 09:00 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.24560v1

Overview

Large language models (LLMs) can now turn natural‑language prompts into functional source code, but the generated snippets often need tweaking before they truly match a developer’s intent. This paper introduces a way to pinpoint exactly which lines of generated code are likely to need edits, giving developers a calibrated confidence score for each region of the output.

Key Contributions

  • Minimal Intent‑Aligning Patch dataset – a curated collection of LLM‑generated programs together with the smallest line‑level edits (patches) that make them pass a suite of test cases.
  • Localized calibration framework – methods to assign well‑calibrated probabilities to individual code spans, indicating the odds they will be edited in the minimal patch.
  • White‑box probing technique – an efficient “arbitrary‑span query” probe that leverages a small supervisor model to estimate edit probabilities across any contiguous block of code.
  • Benchmark against black‑box baselines – reflective prompting and self‑consistency approaches are evaluated, showing the probe’s superior calibration (Brier Skill Score ≈ 0.2).
  • Cross‑domain generalization hint – a probe trained solely on code errors exhibits some ability to flag natural‑language errors when a simple probability scaling is applied.

Methodology

  1. Dataset Construction

    • Generate code snippets using several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude).
    • Run each snippet against a set of unit tests; when failures occur, compute the minimal line‑level edit that makes the program pass.
    • Store the original snippet, the test suite, and the minimal patch – this becomes the “Intent‑Aligning Patch” dataset.
  2. Calibration Objective

    • For any line (or span) i in a generated program, we want a probability p_i such that the empirical frequency of that line being edited equals p_i.
    • Calibration is measured using Expected Calibration Error (ECE) and Brier Skill Score (BSS).
  3. Probing Approach (White‑Box)

    • Train a lightweight “supervisor” model (e.g., a 1‑Billion‑parameter transformer) to predict edit probabilities from internal activations of the large code LLM.
    • Implement an arbitrary‑span query: the probe can efficiently aggregate token‑level signals to produce a probability for any contiguous block without re‑running the whole model.
  4. Black‑Box Baselines

    • Reflective prompting: ask the LLM to self‑diagnose its output (“Which lines might be wrong?”).
    • Self‑consistency: generate multiple samples and compute variance across them as a proxy for uncertainty.
  5. Evaluation

    • Compare calibration error and BSS across probes and baselines on a held‑out test set of patches.
    • Test cross‑domain transfer by applying the code‑trained probe to natural‑language error detection tasks with a simple scaling adjustment.

Results & Findings

MethodECE ↓Brier Skill Score ↑
White‑box probe (small supervisor)0.070.20
Reflective prompting0.180.05
Self‑consistency variance0.150.07
  • The probe’s calibration error is ~3× lower than the best black‑box baseline.
  • Even though the supervisor model is orders of magnitude smaller than the target LLM, it can reliably predict which lines will be edited.
  • When the probe’s output is linearly rescaled, it modestly flags grammatical errors in plain English, suggesting the learned signal captures a more general notion of “uncertainty”.

Practical Implications

  • IDE Integration: Plug the probe into code editors to highlight “high‑risk” lines in real time, letting developers focus their review where it matters most.
  • Automated Refactoring Tools: Use calibrated edit probabilities to prioritize automated fixes or suggest targeted test generation.
  • Continuous Integration (CI): CI pipelines could automatically flag generated code that exceeds a calibrated risk threshold, prompting human review before merge.
  • Model‑agnostic Oversight: Because the probe works with internal activations, it can be attached to any future, larger code LLM without retraining the whole model, offering a lightweight safety layer.
  • Cost Savings: By narrowing the debugging surface, teams can reduce the time spent on post‑generation edits, especially in large‑scale code‑generation workflows (e.g., scaffolding microservices, data‑pipeline scripts).

Limitations & Future Work

  • Dataset Scope: The patches are derived from relatively small, self‑contained programming tasks; scaling to large, multi‑file projects may expose new failure modes.
  • Probe Generalization: While early results hint at cross‑domain transfer, the probe still requires task‑specific scaling to handle natural‑language errors effectively.
  • Dependency on Test Suites: The notion of “minimal edit” hinges on the quality and completeness of the provided tests; flaky or under‑specified tests could bias calibration.
  • Black‑Box Alternatives: More sophisticated ensemble or Bayesian approaches might close the gap further, a direction the authors suggest exploring.

Bottom line: This work offers a practical, calibrated uncertainty signal for code‑generating LLMs, turning the “black‑box” output into a more actionable artifact for developers and tooling ecosystems.

Authors

  • David Gros
  • Prem Devanbu

Paper Information

  • arXiv ID: 2512.24560v1
  • Categories: cs.SE, cs.AI
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »