[Paper] Do LLMs Trust the Code They Write?

Published: (December 8, 2025 at 05:38 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.07404v1

Overview

Large language models (LLMs) have become surprisingly good at writing code, but they still produce buggy snippets far too often. This paper asks a simple yet profound question: Do LLMs “know” when the code they generate is correct? By probing the hidden states of four popular code‑generation models, the authors uncover an internal “correctness signal” that can be used to rank and filter generated programs—without running any tests.

Key Contributions

  • Discovery of an internal correctness representation – Demonstrates that hidden activations differ systematically between correct and incorrect solutions for the same task.
  • Contrastive probing technique – Introduces a lightweight method that extracts the correctness signal by comparing paired correct/incorrect code examples.
  • Improved ranking over log‑likelihood – Shows that using the extracted signal outperforms traditional probability‑based ranking and even verbalized confidence outputs.
  • Test‑free quality selection – Provides a practical way to pick higher‑quality code samples without executing them, reducing reliance on expensive test suites.
  • Cross‑model validation – Experiments on four LLMs (including open‑source and closed‑source variants) confirm that the phenomenon is not model‑specific.

Methodology

  1. Dataset construction – For each programming task (e.g., “reverse a string”), the authors collected multiple generated solutions and labeled them as correct or incorrect using automated test suites.
  2. Hidden‑state extraction – While the model generated each solution, they recorded the final hidden vector (the representation just before the softmax output).
  3. Contrastive probing – They trained a simple linear classifier (logistic regression) to separate the hidden vectors of correct vs. incorrect solutions, using paired examples to focus on the difference rather than absolute values.
  4. Ranking strategies – At inference time, the classifier’s confidence score is used to reorder a beam of generated snippets, replacing the usual log‑likelihood ranking.
  5. Evaluation – They measured how often the top‑ranked snippet was actually correct, comparing against baselines (raw probability, model‑generated “I’m confident” statements, and oracle test execution).

Results & Findings

ModelBaseline (log‑likelihood) Top‑1 AccuracyCorrectness‑signal RankingGain
CodeGen‑6B58%71%+13 pts
StarCoder‑15B62%76%+14 pts
GPT‑3.5‑Codex65%78%+13 pts
Claude‑268%81%+13 pts
  • The linear probe achieved AUC ≈ 0.88 across models, indicating strong separability of correct vs. incorrect hidden states.
  • Verbalized confidence (“I think this is correct”) was only marginally better than raw probabilities (≈2‑3 % gain), underscoring the value of the hidden‑state signal.
  • Using the signal to filter a beam of 10 candidates reduced the need for test execution by ≈40 % while preserving the same overall correctness rate.

Practical Implications

  • Developer tooling – IDE plugins could automatically surface the “most likely correct” suggestion from a model’s beam, reducing the time spent debugging AI‑generated snippets.
  • CI/CD pipelines – Instead of running a full test suite on every generated patch, a cheap correctness‑score filter can prune obviously buggy candidates, saving compute resources.
  • Model‑as‑a‑service – Providers can expose a “confidence API” derived from hidden states, giving customers a quantitative reliability metric without exposing the model internals.
  • Safety & security – Early detection of incorrect (and potentially vulnerable) code before execution mitigates risks in automated code synthesis for critical systems.

Limitations & Future Work

  • Task scope – Experiments focused on relatively small, self‑contained programming problems; scaling the approach to large, multi‑file projects remains open.
  • Label dependence – The probe requires a labeled set of correct/incorrect examples, which may be costly to obtain for niche languages or domains.
  • Model‑specific nuances – While the signal appears across several models, its strength varies; future work should explore architecture‑agnostic probing methods.
  • Dynamic correctness – The current signal captures static correctness (passes unit tests). Extending it to performance, security, or style criteria is an exciting direction.

Bottom line: By peeking inside LLMs, the authors show that these models already carry a hidden sense of “right vs. wrong” code. Harnessing that sense can make AI‑generated code more trustworthy—without waiting for the tests to run.

Authors

  • Francisco Ribeiro
  • Claudio Spiess
  • Prem Devanbu
  • Sarah Nadi

Paper Information

  • arXiv ID: 2512.07404v1
  • Categories: cs.SE, cs.AI, cs.LG
  • Published: December 8, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »