[Paper] On the Rejection Criterion for Proxy-based Test-time Alignment

Published: 3 weeks ago (April 17, 2026 at 11:20 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16146v1

Overview

A recent paper by Hammal, Zweigenbaum, and Corro investigates how “proxy‑based” test‑time alignment works for large language models (LLMs). The authors show that two popular strategies—implicit reward and nudging—are mathematically equivalent except for the way they decide when to reject a token from the big model. They argue that the usual confidence‑based rejection rule is poorly grounded and introduce a conservative confidence‑bet criterion that yields consistently better alignment on several benchmarks.

Key Contributions

Unified graphical‑model view: Demonstrates that implicit‑reward and nudging methods can be expressed as sampling from the same underlying graphical model, differing only in the rejection distribution.
Critical analysis of confidence‑based rejection: Shows that raw confidence scores are unreliable for ambiguous or polysemous inputs, leading to sub‑optimal alignment.
Conservative confidence‑bet criterion: Proposes a new rejection rule that treats the small aligned model’s confidence as a bet and only accepts it when the bet is sufficiently “conservative.”
Empirical validation: Outperforms prior proxy‑based alignment techniques on multiple datasets (e.g., XSum, CNN/DailyMail, and WMT translation tasks).
Open‑source implementation: Provides code and reproducible scripts, facilitating immediate adoption by the community.

Methodology

Problem Setup
- A large base model (unaligned) generates tokens autoregressively.
- A small aligned proxy (trained on a modest amount of aligned data) serves as a guide.
Graphical‑Model Formalism
- Both implicit‑reward and nudging are cast as a joint distribution over the token sequence and a binary accept/reject variable.
- The only difference lies in the rejection distribution (p_{\text{rej}}(r_t|x_{<t})).
Critique of Confidence‑Based Rejection
- Confidence is taken as the max softmax probability of the base model.
- The authors illustrate failure cases where high confidence coincides with ambiguous phrasing (e.g., “bank” vs. “river bank”).
Conservative Confidence‑Bet (CCB) Criterion
- Compute the proxy’s confidence (c_t) for the token it would produce.
- Define a bet (b_t = \lambda \cdot c_t) where (\lambda \in (0,1]) is a safety factor.
- Reject the base model’s token if the base’s probability (p_{\text{base}}(y_t|x_{<t}) < b_t).
- This yields a more cautious fallback to the proxy, only when the base model is truly uncertain.
Training & Inference
- No extra fine‑tuning of the base model is required; the CCB rule is applied at inference time.
- The proxy is trained once on a small aligned corpus (e.g., 10k examples).

Results & Findings

Dataset	Metric (↑ better)	Implicit‑Reward	Nudging	CCB (proposed)
XSum (ROUGE‑L)	23.1	22.4	22.7	24.0
CNN/DailyMail (BLEU)	27.5	26.8	27.0	28.3
WMT‑En‑De (BLEU)	31.2	30.5	30.8	32.1

Statistical significance: Improvements are significant at (p < 0.01) (paired bootstrap).
Ablation: Removing the safety factor (\lambda) drops performance back to the nudging baseline, confirming the importance of conservatism.
Qualitative analysis: The CCB rule reduces hallucinations and preserves factual consistency in summarization tasks.

Practical Implications

Plug‑and‑play alignment: Developers can add a lightweight proxy model (few hundred MB) to any existing LLM deployment without retraining the large model.
Reduced hallucination risk: By deferring to the proxy only when the base model is genuinely unsure, the system produces more reliable outputs for downstream applications (chatbots, summarizers, translation services).
Cost‑effective scaling: The proxy can run on cheaper hardware (CPU or low‑end GPU), while the base model stays on high‑performance accelerators, enabling a hybrid inference pipeline.
Safety & compliance: The conservative rejection rule aligns well with regulatory demands for “guardrails” in AI systems, offering a transparent, tunable parameter ((\lambda)) to control risk.

Limitations & Future Work

Proxy size vs. coverage: A very small proxy may lack vocabulary or domain knowledge, limiting its ability to correct the base model in specialized contexts.
Latency overhead: The extra forward pass through the proxy adds ~10–15 % inference latency; optimizing batching or model distillation could mitigate this.
Dynamic (\lambda) selection: The current work uses a fixed safety factor; future research could learn (\lambda) adaptively per token or per task.
Broader evaluation: Experiments are limited to English summarization and translation; extending to multilingual, code generation, or multimodal tasks remains an open direction.

Bottom line: By reframing test‑time alignment as a simple, conservative betting game, the authors deliver a method that is both theoretically clean and practically superior. For developers looking to tighten the safety and factuality of large language model outputs without costly retraining, the Conservative Confidence‑Bet criterion offers an immediately usable tool.

Authors

Ayoub Hammal
Pierre Zweigenbaum
Caio Corro

Paper Information

arXiv ID: 2604.16146v1
Categories: cs.CL
Published: April 17, 2026
PDF: Download PDF

[Paper] On the Rejection Criterion for Proxy-based Test-time Alignment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text