[Paper] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published: 1 month ago (January 8, 2026 at 01:59 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.05242v1

Overview

The paper introduces GDPO (Group reward‑Decoupled Normalization Policy Optimization), a new reinforcement‑learning (RL) algorithm designed for large language models (LLMs) that must satisfy multiple human‑aligned preferences simultaneously (e.g., correctness, format, safety). The authors show that the commonly used Group Relative Policy Optimization (GRPO) collapses distinct reward signals during normalization, hurting learning stability and final performance. GDPO fixes this by normalizing each reward independently, preserving their relative magnitudes and enabling more reliable multi‑reward training.

Key Contributions

Problem Diagnosis: Demonstrates analytically and empirically that GRPO’s shared normalization causes different reward streams to converge to the same advantage, degrading the training signal.
GDPO Algorithm: Proposes a simple yet effective modification—decoupled per‑reward normalization—while retaining the core benefits of group‑wise policy updates.
Comprehensive Evaluation: Benchmarks GDPO against GRPO on three diverse LLM tasks (tool‑calling, math reasoning, coding reasoning) using both correctness (accuracy, bug ratio) and constraint (format, length) metrics.
Stability Gains: Shows markedly smoother loss curves and fewer early‑training crashes, indicating higher robustness for large‑scale RL pipelines.
Open‑source Potential: The method is compatible with existing RL‑HF (Reinforcement Learning from Human Feedback) stacks, requiring only a change in the advantage‑normalization step.

Methodology

Multi‑Reward Setup:
- Each training example receives a vector of scalar rewards (r = (r_1, r_2, \dots, r_K)) (e.g., factual correctness, response length, JSON format).
- The total advantage is traditionally computed by aggregating these rewards and then applying a single normalization across the batch (GRPO).
Problem with Shared Normalization:
- When rewards differ in scale or distribution, the shared mean‑variance normalization compresses their differences, making the resulting advantage values nearly identical across groups.
- This “advantage collapse” reduces the gradient’s ability to distinguish which reward should be prioritized.
GDPO’s Decoupled Normalization:
- Compute separate mean (\mu_k) and standard deviation (\sigma_k) for each reward dimension (k) across the batch.
- Normalize each advantage component independently: (\hat{A}_k = (A_k - \mu_k) / \sigma_k).
- Combine the normalized components (e.g., weighted sum) to obtain the final advantage used in the policy‑gradient update.
Training Loop:
- The rest of the RL pipeline (trajectory collection, KL‑penalty, PPO‑style clipping) stays unchanged, making GDPO a drop‑in replacement for GRPO in existing codebases.

Results & Findings

Task	Metric	GRPO	GDPO
Tool Calling	Correctness (Acc.)	71.2 %	78.9 %
	Format adherence	64.5 %	73.1 %
Math Reasoning	Accuracy	58.3 %	66.7 %
	Length constraint	61.0 %	69.4 %
Coding Reasoning	Bug‑free ratio	45.8 %	53.2 %
	JSON format	52.1 %	60.5 %

Training Stability: GDPO’s loss curves exhibit far fewer spikes and rarely diverge, whereas GRPO shows occasional early‑training crashes (especially on the coding task).
Generalizability: The performance boost holds across tasks with very different reward structures, suggesting the method is not task‑specific.
Ablation: Removing per‑reward normalization (i.e., reverting to shared normalization) reproduces the GRPO degradation, confirming the core hypothesis.

Practical Implications

Better Multi‑Objective RL for LLMs: Developers building chatbots, code assistants, or agents that must obey format constraints (e.g., JSON APIs) can achieve higher fidelity without redesigning their reward engineering.
Plug‑and‑Play Upgrade: Since GDPO only changes the advantage‑normalization step, it can be integrated into popular RL‑HF libraries (e.g., trl, trlx) with a few lines of code.
Reduced Training Costs: More stable gradients mean fewer restarts and less wasted GPU time, which is especially valuable for large‑scale models (70B+).
Improved Safety & Alignment: By preserving the distinct signals from safety‑related rewards (toxicity, bias) alongside utility rewards, GDPO helps maintain alignment guarantees while still optimizing performance.
Potential for Automated Reward Weighting: Because each reward retains its scale, downstream methods that learn optimal weighting (e.g., meta‑learning) can operate more reliably.

Limitations & Future Work

Scalability of Reward Count: The paper evaluates up to three reward dimensions; extremely high‑dimensional reward vectors may introduce new normalization challenges (e.g., covariance between rewards).
Weight Selection: GDPO still requires manual weighting of the normalized advantages; learning these weights automatically remains an open problem.
Theoretical Guarantees: While empirical results are strong, a formal convergence analysis for the decoupled normalization in the multi‑reward setting is not provided.
Broader Benchmarks: Future work could test GDPO on open‑ended generation tasks (e.g., story‑writing) where reward definitions are more subjective.

Bottom line: GDPO offers a low‑overhead, high‑impact improvement for anyone training LLMs with multiple, possibly conflicting, reward signals—making multi‑objective RL both more stable and more effective.

Authors

Shih‑Yang Liu
Xin Dong
Ximing Lu
Shizhe Diao
Peter Belcak
Mingjie Liu
Min‑Hung Chen
Hongxu Yin
Yu‑Chiang Frank Wang
Kwang‑Ting Cheng
Yejin Choi
Jan Kautz
Pavlo Molchanov

Paper Information

arXiv ID: 2601.05242v1
Categories: cs.CL, cs.AI, cs.LG
Published: January 8, 2026
PDF: Download PDF

[Paper] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] AdaFuse: Adaptive Ensemble Decoding with Test-Time Scaling for LLMs

[Paper] The Molecular Structure of Thought: Mapping the Topology of Long Chain-of-Thought Reasoning

[Paper] Can We Predict Before Executing Machine Learning Agents?

[Paper] Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency