[Paper] Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

Published: (June 7, 2026 at 03:59 AM EDT)
2 min read
Source: arXiv

Source: arXiv - 2606.08501v1

Overview

Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM’s generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

Key Contributions

This paper presents research in the following areas:

  • cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

  • Yawen Shao
  • Jie Xiao
  • Kai Zhu
  • Yu Liu
  • Hongchen Luo
  • Xueyang Fu
  • Yang Cao
  • Wei Zhai
  • Zheng-Jun Zha

Paper Information

  • arXiv ID: 2606.08501v1
  • Categories: cs.CL
  • Published: June 7, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »