[Paper] PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

Published: 3 days ago (June 7, 2026 at 05:51 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.08543v1

Overview

Reinforcement learning with verifiable rewards (RLVR) improves large language model reasoning but often suffers from rapid policy-entropy collapse, where the policy prematurely concentrates on narrow high-probability reasoning paths. While global entropy regularization can encourage exploration, uniformly increasing entropy across all token positions is inefficient for long reasoning trajectories, where many tokens are not decision-relevant. We propose Position-Aware Entropy Calibration (PAEC), a token-level entropy-management framework that constructs a soft mask from local top-p entropy and top-two candidate competition, and applies an anchor-based lower-bound penalty to prevent selected-position entropy collapse. Experiments on five mathematical reasoning benchmarks show that PAEC improves macro-average majority-vote performance over strong RLVR baselines, with clear gains on AIME-style tasks. Our results suggest that entropy management in reasoning RL should be formulated as selective exploration allocation over decision-sensitive positions rather than uniform randomness injection.

Key Contributions

This paper presents research in the following areas:

cs.AI

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.AI.

Authors

Shumeng Yang
Yisu Liu
Jiayi Zheng
Zhaohui Yang
Linjing Li

Paper Information

arXiv ID: 2606.08543v1
Categories: cs.AI
Published: June 7, 2026
PDF: Download PDF

[Paper] PAEC: Position-Aware Entropy Calibration for LLM Reasoning in RLVR

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] When to Align, When to Predict: A Phase Diagram for Multimodal Learning

[Paper] A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

[Paper] EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

[Paper] The Role of Feedback Alignment in Self-Distillation