[Paper] Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

Published: 3 days ago (June 7, 2026 at 03:59 AM EDT)

2 min read

Source: arXiv

Source: arXiv - 2606.08501v1

Overview

Reinforcement learning (RL) holds immense promise for enhancing the reasoning capabilities of diffusion large language models (dLLMs). However, progress is fundamentally constrained by a dual misalignment between authentic generation trajectory and the gradient update process: (i) Process-reward misalignment. Sparse, terminal rewards are indiscriminately assigned to all intermediate steps of the generation process, failing to provide discriminative credit assignment. (ii) State-trajectory misalignment. Policy updates are often diverted toward artificial, out-of-trajectory states, squandering gradients on less informative samples. To address these limitations, we introduce Process Aligned Policy Optimization (PAPO), a novel framework that holistically aligns the RL update with the dLLM’s generative trajectory via Step-Aware Process Rewards (SPR) that transform sparse terminal rewards into dense, step-wise credit, and Entropy-Guided Historical Re-enactment (EHR) that replays authentic trajectories at high-uncertainty steps. Extensive experiments on four benchmarks demonstrate that PAPO significantly outperforms baselines, achieving gains of up to 4.5% on GSM8K, 4.8% on MATH500, 42.2% on Countdown and 16.1% on Sudoku.

Key Contributions

This paper presents research in the following areas:

cs.CL

Methodology

Please refer to the full paper for detailed methodology.

Practical Implications

This research contributes to the advancement of cs.CL.

Authors

Yawen Shao
Jie Xiao
Kai Zhu
Yu Liu
Hongchen Luo
Xueyang Fu
Yang Cao
Wei Zhai
Zheng-Jun Zha

Paper Information

arXiv ID: 2606.08501v1
Categories: cs.CL
Published: June 7, 2026
PDF: Download PDF

[Paper] Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

Overview

Key Contributions

Methodology

Practical Implications

Authors

Paper Information

Related posts

[Paper] A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

[Paper] Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

[Paper] Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models

[Paper] Provenance-Grounded Gating and Adaptive Recovery in Synthetic Post-Training Data Curation