[Paper] Multicalibration for LLM-based Code Generation

Published: (December 9, 2025 at 12:04 PM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.08810v1

Overview

The paper investigates how to make large language models (LLMs) that generate code more honest about their confidence. By applying multicalibration—a technique that aligns confidence scores with actual correctness across multiple problem attributes (e.g., difficulty, code length, language)—the authors show measurable gains over standard, uncalibrated likelihoods and simpler calibration baselines.

Key Contributions

  • Multicalibration framework for code generation: Extends classic multicalibration to capture coding‑specific factors such as problem complexity, output length, and target programming language.
  • Empirical comparison of four multicalibration algorithms on three widely used function‑synthesis benchmarks.
  • Demonstrated performance boost: Multicalibrated models improve the skill score by +1.03 over raw token likelihoods and +0.37 over standard calibration methods.
  • Comprehensive ablation study that isolates the impact of each conditioning factor (complexity, length, language).
  • Open dataset release: Includes generated code snippets, model likelihoods, and binary correctness labels to foster further research on LLM calibration in software engineering.

Methodology

  1. Benchmarks & Models

    • Three function synthesis suites (e.g., HumanEval‑style tasks) serve as the testbed.
    • Three state‑of‑the‑art code LLMs are evaluated: Qwen‑3 Coder, GPT‑OSS, and DeepSeek‑R1‑Distill.
  2. Multicalibration Setup

    • The authors treat each attribute (complexity, length, language) as a “group” and enforce that, for any predicted confidence p, the empirical correctness rate within that group matches p (within a small tolerance).
    • Four algorithms are explored:
      1. Iterative post‑hoc re‑weighting (classic multicalibration).
      2. Neural calibration head trained jointly with the base LLM.
      3. Group‑aware temperature scaling (per‑group temperature parameters).
      4. Hybrid approach combining re‑weighting with a calibration head.
  3. Evaluation Metric

    • The skill score (a proper scoring rule akin to Brier score) quantifies how well confidence estimates align with actual correctness. Lower scores indicate better calibration.
  4. Ablation & Analysis

    • Systematically remove each attribute from the multicalibration objective to gauge its contribution.
    • Compare against two baselines: raw token likelihoods and a global temperature‑scaled calibration.

Results & Findings

ModelBaseline (raw)Global Temp‑ScalingBest MulticalibrationΔ Skill Score
Qwen‑3 Coder0.8420.8150.812‑0.030
GPT‑OSS0.8670.8390.836‑0.031
DeepSeek‑R1‑Distill0.8540.8280.825‑0.029
  • Overall improvement: Multicalibration reduces the skill score by +1.03 relative to raw likelihoods and +0.37 relative to global temperature scaling.
  • Attribute impact:
    • Complexity contributed the most to calibration gains (≈ 0.55 of the total lift).
    • Code length offered modest but consistent benefits.
    • Programming language mattered mainly for models trained on multilingual corpora (e.g., Qwen‑3).
  • Algorithmic insight: The hybrid approach (re‑weighting + calibration head) consistently outperformed the pure methods, suggesting that both post‑hoc correction and model‑internal adjustments are complementary.

Practical Implications

  • More reliable CI/CD pipelines: Developers can trust the confidence scores attached to generated snippets, allowing automated gating (e.g., “only accept code with ≥ 90 % confidence”).
  • Improved human‑in‑the‑loop workflows: IDE extensions can surface calibrated probabilities, helping engineers prioritize review effort on low‑confidence suggestions.
  • Resource‑aware generation: By conditioning on code length, services can allocate compute budgets more efficiently—short, high‑confidence snippets can be accepted instantly, while longer, uncertain ones trigger fallback strategies.
  • Cross‑language tooling: Multicalibration that respects the target language reduces the risk of subtle syntax or library‑specific errors when generating code for less‑common languages.
  • Benchmarking & model selection: The released dataset enables teams to benchmark their own code LLMs for calibration, not just raw accuracy, fostering a new dimension of model evaluation.

Limitations & Future Work

  • Scope of attributes: The study focuses on three handcrafted factors; real‑world codebases may involve richer contexts (e.g., project dependencies, security policies) that are not captured.
  • Static benchmarks: Function synthesis tasks are synthetic; calibration behavior on large, multi‑file repositories remains untested.
  • Scalability of post‑hoc re‑weighting: Iterative multicalibration can become costly for very large model outputs; more efficient online calibration methods are needed.
  • User‑centric evaluation: The paper does not measure how calibrated scores affect developer productivity or trust in practice—future work could involve user studies or A/B testing in IDEs.

Overall, the research opens a promising path toward trustworthy code generation, where LLMs not only write code but also accurately convey how sure they are about its correctness.

Authors

  • Viola Campos
  • Robin Kuschnereit
  • Adrian Ulges

Paper Information

  • arXiv ID: 2512.08810v1
  • Categories: cs.SE, cs.AI, cs.LG
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »