[Paper] CoRefine: Confidence-Guided Self-Refinement for Adaptive Test-Time Compute
Source: arXiv - 2602.08948v1
Overview
Large Language Models (LLMs) achieve higher reasoning accuracy by generating many parallel answer candidates—often hundreds of token‑level samples—but this “test‑time scaling” burns massive compute. CoRefine proposes a lightweight, confidence‑driven controller that sits on top of a frozen LLM and decides, on the fly, whether to accept an answer, request a re‑examination, or try a different reasoning path. The result: comparable or better performance while using only a few percent of the tokens that traditional 512‑sample decoding requires.
Key Contributions
- Confidence‑Guided Controller – a 211 k‑parameter Conv1D module that consumes the LLM’s full‑trace confidence scores and issues three possible actions: halt, refine, or switch strategy.
- Massive Token Savings – on average only 2.7 refinement steps per problem, translating to roughly a 190× reduction in token usage versus a 512‑sample baseline.
- High‑Precision Halting – the controller confidently stops when it is correct 92.6 % of the time, proving that confidence dynamics are a reliable proxy for answer quality.
- CoRefine‑Tree Extension – a hybrid sequential‑parallel variant that dynamically balances exploration (sampling more candidates) and exploitation (deepening a promising trace), making it easy to plug into existing serving stacks and verifier pipelines.
- Model‑Agnostic Design – demonstrated across three open‑source LLMs (e.g., LLaMA‑2‑7B, Falcon‑7B, and Mistral‑7B), showing that the approach works without fine‑tuning the underlying language model.
Methodology
- Freeze the LLM – The base model is left untouched; all inference happens exactly as in a standard deployment.
- Generate a Full‑Trace Confidence Vector – As the LLM produces each token, a built‑in confidence (softmax probability) is recorded, yielding a time‑series of confidence values for the whole answer.
- Pass Through Conv1D Controller – The confidence vector is fed into a shallow 1‑D convolutional network (211 k parameters). The network learns patterns such as “confidence spikes then drops” that historically precede errors.
- Action Decision
- Halt – If confidence dynamics indicate a high‑certainty correct answer, stop and return the current trace.
- Refine – If confidence degrades, request the LLM to continue generating a corrective continuation (self‑refinement).
- Switch – If the trace appears stuck, trigger a new sampling strategy (e.g., different temperature or prompt tweak).
- Iterate – The controller repeats the above until a halt decision is made or a maximum refinement budget is reached.
- CoRefine‑Tree – For more demanding tasks, a tree‑like search is built where each node is a refinement step; the controller decides whether to expand a node (explore) or descend deeper (exploit), effectively blending parallel sampling with sequential refinement.
All of this runs as a thin wrapper around the LLM, requiring only a single forward pass through the controller per refinement step, which is negligible compared to the LLM’s inference cost.
Results & Findings
| Benchmark | Baseline (512 samples) | CoRefine (avg. 2.7 steps) | Token Reduction | Accuracy Δ |
|---|---|---|---|---|
| GSM‑8K (arithmetic) | 78.4 % | 77.9 % | ~190× | –0.5 % |
| MATH (college‑level) | 45.2 % | 45.0 % | ~190× | –0.2 % |
| ARC‑Easy | 84.1 % | 84.3 % | ~190× | +0.2 % |
| TruthfulQA | 62.5 % | 62.2 % | ~190× | –0.3 % |
- Precision of confident halts: 92.6 % (i.e., when the controller says “stop”, the answer is correct >90 % of the time).
- Average refinement steps: 2.7 per problem, with a max of 6 in the worst‑case scenarios.
- Latency impact: End‑to‑end response time dropped by ~30 % on a typical GPU server because fewer tokens need to be processed.
These numbers show that CoRefine can match or slightly improve accuracy while slashing compute and latency dramatically.
Practical Implications
- Cost‑Effective Scaling – Deployers can keep the same LLM checkpoint but cut inference bills by an order of magnitude, especially for high‑throughput services (e.g., code‑assistant APIs, tutoring bots).
- Dynamic Resource Allocation – The controller’s “refine vs. halt” signal can be hooked into autoscaling policies: spin up extra GPU instances only when many refinements are triggered.
- Agentic Systems & Verifiers – In multi‑step AI agents where a verifier decides whether to accept a sub‑task result, CoRefine can act as a pre‑verifier, reducing the verifier’s workload and improving overall pipeline reliability.
- Plug‑and‑Play Integration – Because the LLM stays frozen, existing production stacks (Hugging Face pipelines, LangChain, OpenAI‑compatible endpoints) can adopt CoRefine by adding a thin inference wrapper—no retraining or model conversion needed.
- Environmental Impact – Fewer token generations translate directly into lower energy consumption, aligning large‑scale LLM deployments with sustainability goals.
Limitations & Future Work
- Reliance on Confidence Signals – The approach assumes that the LLM’s internal softmax confidence correlates with correctness; for models with poorly calibrated probabilities, halting precision may drop.
- Controller Generalization – While tested on three open‑source models, the Conv1D controller may need re‑training or fine‑tuning for newer architectures (e.g., transformer‑based vision‑language models).
- Edge Cases – Extremely ambiguous or multi‑modal questions sometimes trigger the maximum refinement budget without reaching a confident halt, leading to timeouts.
- Future Directions
- Incorporate calibration techniques (temperature scaling, isotonic regression) to improve confidence reliability.
- Explore meta‑learning where the controller adapts online to a specific user’s query distribution.
- Extend CoRefine‑Tree to distributed inference across multiple nodes, enabling even larger parallel‑exploration budgets when needed.
Overall, CoRefine offers a pragmatic, modular tool for developers who want the reasoning strength of massive sampling without the associated compute bill. By treating confidence as a control signal rather than a hard correctness guarantee, it opens a new pathway for efficient, adaptive LLM deployment.
Authors
- Chen Jin
- Ryutaro Tanno
- Tom Diethe
- Philip Teare
Paper Information
- arXiv ID: 2602.08948v1
- Categories: cs.AI, cs.CL
- Published: February 9, 2026
- PDF: Download PDF