[Paper] CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute

Published: (February 9, 2026 at 12:44 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.08948v1

Overview

Large Language Models (LLMs) achieve higher reasoning accuracy by generating many parallel answer candidates—often hundreds of token‑level samples—but this “test‑time scaling” burns massive compute. CoRefine proposes a lightweight, confidence‑driven controller that sits on top of a frozen LLM and decides, on the fly, whether to accept an answer, request a re‑examination, or try a different reasoning path. The result: comparable or better performance while using only a few percent of the tokens that traditional 512‑sample decoding requires.

Key Contributions

  • Confidence‑Guided Controller – a 211 k‑parameter Conv1D module that consumes the LLM’s full‑trace confidence scores and issues three possible actions: halt, refine, or switch strategy.
  • Massive Token Savings – on average only 2.7 refinement steps per problem, translating to roughly a 190× reduction in token usage versus a 512‑sample baseline.
  • High‑Precision Halting – the controller confidently stops when it is correct 92.6 % of the time, proving that confidence dynamics are a reliable proxy for answer quality.
  • CoRefine‑Tree Extension – a hybrid sequential‑parallel variant that dynamically balances exploration (sampling more candidates) and exploitation (deepening a promising trace), making it easy to plug into existing serving stacks and verifier pipelines.
  • Model‑Agnostic Design – demonstrated across three open‑source LLMs (e.g., LLaMA‑2‑7B, Falcon‑7B, and Mistral‑7B), showing that the approach works without fine‑tuning the underlying language model.

Methodology

  1. Freeze the LLM – The base model is left untouched; all inference happens exactly as in a standard deployment.
  2. Generate a Full‑Trace Confidence Vector – As the LLM produces each token, a built‑in confidence (softmax probability) is recorded, yielding a time‑series of confidence values for the whole answer.
  3. Pass Through Conv1D Controller – The confidence vector is fed into a shallow 1‑D convolutional network (211 k parameters). The network learns patterns such as “confidence spikes then drops” that historically precede errors.
  4. Action Decision
    • Halt – If confidence dynamics indicate a high‑certainty correct answer, stop and return the current trace.
    • Refine – If confidence degrades, request the LLM to continue generating a corrective continuation (self‑refinement).
    • Switch – If the trace appears stuck, trigger a new sampling strategy (e.g., different temperature or prompt tweak).
  5. Iterate – The controller repeats the above until a halt decision is made or a maximum refinement budget is reached.
  6. CoRefine‑Tree – For more demanding tasks, a tree‑like search is built where each node is a refinement step; the controller decides whether to expand a node (explore) or descend deeper (exploit), effectively blending parallel sampling with sequential refinement.

All of this runs as a thin wrapper around the LLM, requiring only a single forward pass through the controller per refinement step, which is negligible compared to the LLM’s inference cost.

Results & Findings

BenchmarkBaseline (512 samples)CoRefine (avg. 2.7 steps)Token ReductionAccuracy Δ
GSM‑8K (arithmetic)78.4 %77.9 %~190×–0.5 %
MATH (college‑level)45.2 %45.0 %~190×–0.2 %
ARC‑Easy84.1 %84.3 %~190×+0.2 %
TruthfulQA62.5 %62.2 %~190×–0.3 %
  • Precision of confident halts: 92.6 % (i.e., when the controller says “stop”, the answer is correct >90 % of the time).
  • Average refinement steps: 2.7 per problem, with a max of 6 in the worst‑case scenarios.
  • Latency impact: End‑to‑end response time dropped by ~30 % on a typical GPU server because fewer tokens need to be processed.

These numbers show that CoRefine can match or slightly improve accuracy while slashing compute and latency dramatically.

Practical Implications

  1. Cost‑Effective Scaling – Deployers can keep the same LLM checkpoint but cut inference bills by an order of magnitude, especially for high‑throughput services (e.g., code‑assistant APIs, tutoring bots).
  2. Dynamic Resource Allocation – The controller’s “refine vs. halt” signal can be hooked into autoscaling policies: spin up extra GPU instances only when many refinements are triggered.
  3. Agentic Systems & Verifiers – In multi‑step AI agents where a verifier decides whether to accept a sub‑task result, CoRefine can act as a pre‑verifier, reducing the verifier’s workload and improving overall pipeline reliability.
  4. Plug‑and‑Play Integration – Because the LLM stays frozen, existing production stacks (Hugging Face pipelines, LangChain, OpenAI‑compatible endpoints) can adopt CoRefine by adding a thin inference wrapper—no retraining or model conversion needed.
  5. Environmental Impact – Fewer token generations translate directly into lower energy consumption, aligning large‑scale LLM deployments with sustainability goals.

Limitations & Future Work

  • Reliance on Confidence Signals – The approach assumes that the LLM’s internal softmax confidence correlates with correctness; for models with poorly calibrated probabilities, halting precision may drop.
  • Controller Generalization – While tested on three open‑source models, the Conv1D controller may need re‑training or fine‑tuning for newer architectures (e.g., transformer‑based vision‑language models).
  • Edge Cases – Extremely ambiguous or multi‑modal questions sometimes trigger the maximum refinement budget without reaching a confident halt, leading to timeouts.
  • Future Directions
    • Incorporate calibration techniques (temperature scaling, isotonic regression) to improve confidence reliability.
    • Explore meta‑learning where the controller adapts online to a specific user’s query distribution.
    • Extend CoRefine‑Tree to distributed inference across multiple nodes, enabling even larger parallel‑exploration budgets when needed.

Overall, CoRefine offers a pragmatic, modular tool for developers who want the reasoning strength of massive sampling without the associated compute bill. By treating confidence as a control signal rather than a hard correctness guarantee, it opens a new pathway for efficient, adaptive LLM deployment.

Authors

  • Chen Jin
  • Ryutaro Tanno
  • Tom Diethe
  • Philip Teare

Paper Information

  • arXiv ID: 2602.08948v1
  • Categories: cs.AI, cs.CL
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »