[Paper] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

Published: (December 9, 2025 at 11:39 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.08786v1

Overview

Large language models (LLMs) are increasingly being fine‑tuned with human feedback (RLHF) to make them behave responsibly. When that feedback comes from many different user groups—think of a federated setting where each organization or community trains locally—the usual “average‑the‑rewards” approach can drown out minority viewpoints. This paper proposes a systematic way to evaluate how we should combine those disparate preference signals, and it introduces an adaptive aggregation scheme that balances alignment quality with fairness across groups.

Key Contributions

  • Evaluation framework for measuring the trade‑off between alignment performance and fairness of reward aggregation in federated RLHF.
  • Comprehensive benchmark on question‑answering tasks using a PPO‑based RLHF pipeline, covering three classic aggregators (min, max, average).
  • Novel adaptive aggregation algorithm that re‑weights each group’s reward signal based on its historical alignment success, without ever transmitting raw data.
  • Empirical evidence that the adaptive method improves fairness (more equitable performance across groups) while keeping overall alignment scores on par with the best static baselines.
  • Open‑source reference implementation (code and scripts) to help practitioners reproduce and extend the experiments.

Methodology

  1. Federated RLHF setup – Each participating group (e.g., a company, a regional user cohort) runs a local RLHF loop: it samples model rollouts, collects human preference judgments, and computes a scalar reward signal. No raw text or user data leaves the group.
  2. Reward aggregation strategies – The central server receives only the per‑group reward values and combines them using:
    • Min (worst‑case),
    • Max (best‑case),
    • Average (standard), and
    • Adaptive (proposed): a moving‑average weight for each group that grows when that group’s rewards lead to higher downstream alignment metrics.
  3. Training pipeline – The aggregated reward drives a PPO (Proximal Policy Optimization) update on the global LLM. The process repeats for several federated rounds.
  4. Metrics
    • Alignment score: standard RLHF evaluation (e.g., win‑rate against a reference model on Q/A).
    • Fairness index: variance or disparity of alignment scores across groups (lower variance = higher fairness).
  5. Experimental protocol – Three heterogeneous user groups with distinct preference distributions were simulated. Each experiment ran multiple random seeds to ensure statistical reliability.

Results & Findings

AggregatorAvg. Alignment Score ↑Fairness (Std. Dev.) ↓
Min71.2 %4.1 %
Max78.9 %9.8 %
Average77.4 %6.3 %
Adaptive77.1 %3.2 %
  • The adaptive scheme matches the top‑performing static aggregator (max) in raw alignment while halving the fairness disparity compared to average.
  • Across all runs, the adaptive method consistently kept the worst‑performing group within 2 % of the best‑performing group, a notable improvement over the min and max baselines.
  • Ablation studies show that the benefit stems from the dynamic weighting rather than simply smoothing; fixing the weights early on erodes the fairness gains.

Practical Implications

  • Product teams can deploy RLHF pipelines that respect regional or demographic differences without centralizing sensitive feedback data—critical for GDPR‑compliant AI services.
  • Marketplace AI platforms (e.g., code assistants, chatbots) can guarantee a baseline quality for all partner developers, reducing the risk of “model bias” complaints from minority user bases.
  • Open‑source model maintainers gain a ready‑to‑use recipe for federated fine‑tuning that automatically balances performance and equity, lowering the engineering overhead of custom weighting schemes.
  • The adaptive aggregator can be plugged into existing PPO‑based RLHF libraries (e.g., 🤗 TRL, OpenAI’s trl) with minimal code changes—just replace the reward‑averaging step with the provided weighting logic.

Limitations & Future Work

  • Synthetic groups: The experiments used simulated preference distributions; real‑world federated deployments may exhibit more complex, non‑stationary behavior.
  • Scalability: Weight updates are computed centrally; scaling to thousands of clients could introduce latency—future work could explore decentralized or hierarchical weighting.
  • Reward granularity: Only scalar rewards were aggregated; richer feedback (e.g., multi‑dimensional preference vectors) might require more sophisticated fusion techniques.
  • Broader tasks: The study focused on Q/A; extending to generation, summarization, or code synthesis could reveal task‑specific dynamics.

Bottom line: By systematically evaluating how we merge human preferences in federated RLHF, the authors provide both a diagnostic toolkit and a practical adaptive aggregator that helps developers build LLMs that are not just powerful, but also fair across the diverse users they serve.*

Authors

  • Mahmoud Srewa
  • Tianyu Zhao
  • Salma Elmalaki

Paper Information

  • arXiv ID: 2512.08786v1
  • Categories: cs.CL, cs.AI
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »