[Paper] Localized Calibrated Uncertainty in Code Language Models

Published: 1 month ago (December 30, 2025 at 09:00 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.24560v1

Overview

Large language models (LLMs) can now turn natural‑language prompts into functional source code, but the generated snippets often need tweaking before they truly match a developer’s intent. This paper introduces a way to pinpoint exactly which lines of generated code are likely to need edits, giving developers a calibrated confidence score for each region of the output.

Key Contributions

Minimal Intent‑Aligning Patch dataset – a curated collection of LLM‑generated programs together with the smallest line‑level edits (patches) that make them pass a suite of test cases.
Localized calibration framework – methods to assign well‑calibrated probabilities to individual code spans, indicating the odds they will be edited in the minimal patch.
White‑box probing technique – an efficient “arbitrary‑span query” probe that leverages a small supervisor model to estimate edit probabilities across any contiguous block of code.
Benchmark against black‑box baselines – reflective prompting and self‑consistency approaches are evaluated, showing the probe’s superior calibration (Brier Skill Score ≈ 0.2).
Cross‑domain generalization hint – a probe trained solely on code errors exhibits some ability to flag natural‑language errors when a simple probability scaling is applied.

Methodology

Dataset Construction
- Generate code snippets using several state‑of‑the‑art LLMs (e.g., GPT‑4, Claude).
- Run each snippet against a set of unit tests; when failures occur, compute the minimal line‑level edit that makes the program pass.
- Store the original snippet, the test suite, and the minimal patch – this becomes the “Intent‑Aligning Patch” dataset.
Calibration Objective
- For any line (or span) i in a generated program, we want a probability p_i such that the empirical frequency of that line being edited equals p_i.
- Calibration is measured using Expected Calibration Error (ECE) and Brier Skill Score (BSS).
Probing Approach (White‑Box)
- Train a lightweight “supervisor” model (e.g., a 1‑Billion‑parameter transformer) to predict edit probabilities from internal activations of the large code LLM.
- Implement an arbitrary‑span query: the probe can efficiently aggregate token‑level signals to produce a probability for any contiguous block without re‑running the whole model.
Black‑Box Baselines
- Reflective prompting: ask the LLM to self‑diagnose its output (“Which lines might be wrong?”).
- Self‑consistency: generate multiple samples and compute variance across them as a proxy for uncertainty.
Evaluation
- Compare calibration error and BSS across probes and baselines on a held‑out test set of patches.
- Test cross‑domain transfer by applying the code‑trained probe to natural‑language error detection tasks with a simple scaling adjustment.

Results & Findings

Method	ECE ↓	Brier Skill Score ↑
White‑box probe (small supervisor)	0.07	0.20
Reflective prompting	0.18	0.05
Self‑consistency variance	0.15	0.07

The probe’s calibration error is ~3× lower than the best black‑box baseline.
Even though the supervisor model is orders of magnitude smaller than the target LLM, it can reliably predict which lines will be edited.
When the probe’s output is linearly rescaled, it modestly flags grammatical errors in plain English, suggesting the learned signal captures a more general notion of “uncertainty”.

Practical Implications

IDE Integration: Plug the probe into code editors to highlight “high‑risk” lines in real time, letting developers focus their review where it matters most.
Automated Refactoring Tools: Use calibrated edit probabilities to prioritize automated fixes or suggest targeted test generation.
Continuous Integration (CI): CI pipelines could automatically flag generated code that exceeds a calibrated risk threshold, prompting human review before merge.
Model‑agnostic Oversight: Because the probe works with internal activations, it can be attached to any future, larger code LLM without retraining the whole model, offering a lightweight safety layer.
Cost Savings: By narrowing the debugging surface, teams can reduce the time spent on post‑generation edits, especially in large‑scale code‑generation workflows (e.g., scaffolding microservices, data‑pipeline scripts).

Limitations & Future Work

Dataset Scope: The patches are derived from relatively small, self‑contained programming tasks; scaling to large, multi‑file projects may expose new failure modes.
Probe Generalization: While early results hint at cross‑domain transfer, the probe still requires task‑specific scaling to handle natural‑language errors effectively.
Dependency on Test Suites: The notion of “minimal edit” hinges on the quality and completeness of the provided tests; flaky or under‑specified tests could bias calibration.
Black‑Box Alternatives: More sophisticated ensemble or Bayesian approaches might close the gap further, a direction the authors suggest exploring.

Bottom line: This work offers a practical, calibrated uncertainty signal for code‑generating LLMs, turning the “black‑box” output into a more actionable artifact for developers and tooling ecosystems.

Authors

David Gros
Prem Devanbu

Paper Information

arXiv ID: 2512.24560v1
Categories: cs.SE, cs.AI
Published: December 31, 2025
PDF: Download PDF

[Paper] Localized Calibrated Uncertainty in Code Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models