[Paper] Gradient-Guided Reward Optimization for Inference-time Alignment

Published: 3 days ago (June 8, 2026 at 11:33 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.09635v1

Overview

Ensuring the reliability of Large Language Models (LLMs) under distribution drift requires inference-time adaptation. While inference-time alignment methods such as Best-of-$N$ and rejection sampling are widely used, they frame the task as a sampling-intensive, reward-guided search, leading to two key limitations: their performance is bounded by the base model’s generation quality, and their reliance on imperfect reward models makes them vulnerable to reward hacking. To address these challenges, we introduce Gradient-Guided Reward Optimization (GGRO), a lightweight inference-time method that performs targeted, minimal intervention during decoding via gradient guidance. Specifically, GGRO monitors token-level entropy to identify high-uncertainty regions indicative of drift or misalignment. Upon detection, it responds by injecting nudging tokens, generated using gradient signals from an off-the-shelf reward model, to steer the generation trajectory rather than merely re-ranking samples. Experiments show that GGRO consistently improves inference-time alignment across safety, helpfulness, and reasoning benchmarks. It also increases coverage of high-quality responses and robustness to reward hacking, with minimal computational overhead. Code is available at https://github.com/lhk2004/GGRO.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Hankun Lin
Ruqi Zhang

Paper Information

arXiv ID: 2606.09635v1
Categories: cs.CL
Published: June 8, 2026
PDF: Download PDF

[Paper] Gradient-Guided Reward Optimization for Inference-time Alignment

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

[Paper] Doc-to-Atom: Learning to Compile and Compose Memory Atoms

[Paper] Redesign Mixture-of-Experts Routers with Manifold Power Iteration

[Paper] System Report for CCL25-Eval Task 5: New Dataset and LoRA-Fine-Tuned Qwen2.5