[Paper] GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

Published: (January 8, 2026 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.05242v1

Overview

The paper introduces GDPO (Group reward‑Decoupled Normalization Policy Optimization), a new reinforcement‑learning (RL) algorithm designed for large language models (LLMs) that must satisfy multiple human‑aligned preferences simultaneously (e.g., correctness, format, safety). The authors show that the commonly used Group Relative Policy Optimization (GRPO) collapses distinct reward signals during normalization, hurting learning stability and final performance. GDPO fixes this by normalizing each reward independently, preserving their relative magnitudes and enabling more reliable multi‑reward training.

Key Contributions

  • Problem Diagnosis: Demonstrates analytically and empirically that GRPO’s shared normalization causes different reward streams to converge to the same advantage, degrading the training signal.
  • GDPO Algorithm: Proposes a simple yet effective modification—decoupled per‑reward normalization—while retaining the core benefits of group‑wise policy updates.
  • Comprehensive Evaluation: Benchmarks GDPO against GRPO on three diverse LLM tasks (tool‑calling, math reasoning, coding reasoning) using both correctness (accuracy, bug ratio) and constraint (format, length) metrics.
  • Stability Gains: Shows markedly smoother loss curves and fewer early‑training crashes, indicating higher robustness for large‑scale RL pipelines.
  • Open‑source Potential: The method is compatible with existing RL‑HF (Reinforcement Learning from Human Feedback) stacks, requiring only a change in the advantage‑normalization step.

Methodology

  1. Multi‑Reward Setup:

    • Each training example receives a vector of scalar rewards (r = (r_1, r_2, \dots, r_K)) (e.g., factual correctness, response length, JSON format).
    • The total advantage is traditionally computed by aggregating these rewards and then applying a single normalization across the batch (GRPO).
  2. Problem with Shared Normalization:

    • When rewards differ in scale or distribution, the shared mean‑variance normalization compresses their differences, making the resulting advantage values nearly identical across groups.
    • This “advantage collapse” reduces the gradient’s ability to distinguish which reward should be prioritized.
  3. GDPO’s Decoupled Normalization:

    • Compute separate mean (\mu_k) and standard deviation (\sigma_k) for each reward dimension (k) across the batch.
    • Normalize each advantage component independently: (\hat{A}_k = (A_k - \mu_k) / \sigma_k).
    • Combine the normalized components (e.g., weighted sum) to obtain the final advantage used in the policy‑gradient update.
  4. Training Loop:

    • The rest of the RL pipeline (trajectory collection, KL‑penalty, PPO‑style clipping) stays unchanged, making GDPO a drop‑in replacement for GRPO in existing codebases.

Results & Findings

TaskMetricGRPOGDPO
Tool CallingCorrectness (Acc.)71.2 %78.9 %
Format adherence64.5 %73.1 %
Math ReasoningAccuracy58.3 %66.7 %
Length constraint61.0 %69.4 %
Coding ReasoningBug‑free ratio45.8 %53.2 %
JSON format52.1 %60.5 %
  • Training Stability: GDPO’s loss curves exhibit far fewer spikes and rarely diverge, whereas GRPO shows occasional early‑training crashes (especially on the coding task).
  • Generalizability: The performance boost holds across tasks with very different reward structures, suggesting the method is not task‑specific.
  • Ablation: Removing per‑reward normalization (i.e., reverting to shared normalization) reproduces the GRPO degradation, confirming the core hypothesis.

Practical Implications

  • Better Multi‑Objective RL for LLMs: Developers building chatbots, code assistants, or agents that must obey format constraints (e.g., JSON APIs) can achieve higher fidelity without redesigning their reward engineering.
  • Plug‑and‑Play Upgrade: Since GDPO only changes the advantage‑normalization step, it can be integrated into popular RL‑HF libraries (e.g., trl, trlx) with a few lines of code.
  • Reduced Training Costs: More stable gradients mean fewer restarts and less wasted GPU time, which is especially valuable for large‑scale models (70B+).
  • Improved Safety & Alignment: By preserving the distinct signals from safety‑related rewards (toxicity, bias) alongside utility rewards, GDPO helps maintain alignment guarantees while still optimizing performance.
  • Potential for Automated Reward Weighting: Because each reward retains its scale, downstream methods that learn optimal weighting (e.g., meta‑learning) can operate more reliably.

Limitations & Future Work

  • Scalability of Reward Count: The paper evaluates up to three reward dimensions; extremely high‑dimensional reward vectors may introduce new normalization challenges (e.g., covariance between rewards).
  • Weight Selection: GDPO still requires manual weighting of the normalized advantages; learning these weights automatically remains an open problem.
  • Theoretical Guarantees: While empirical results are strong, a formal convergence analysis for the decoupled normalization in the multi‑reward setting is not provided.
  • Broader Benchmarks: Future work could test GDPO on open‑ended generation tasks (e.g., story‑writing) where reward definitions are more subjective.

Bottom line: GDPO offers a low‑overhead, high‑impact improvement for anyone training LLMs with multiple, possibly conflicting, reward signals—making multi‑objective RL both more stable and more effective.

Authors

  • Shih‑Yang Liu
  • Xin Dong
  • Ximing Lu
  • Shizhe Diao
  • Peter Belcak
  • Mingjie Liu
  • Min‑Hung Chen
  • Hongxu Yin
  • Yu‑Chiang Frank Wang
  • Kwang‑Ting Cheng
  • Yejin Choi
  • Jan Kautz
  • Pavlo Molchanov

Paper Information

  • arXiv ID: 2601.05242v1
  • Categories: cs.CL, cs.AI, cs.LG
  • Published: January 8, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »