[Paper] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs
Source: arXiv - 2512.08786v1
Overview
Large language models (LLMs) are increasingly being fine‑tuned with human feedback (RLHF) to make them behave responsibly. When that feedback comes from many different user groups—think of a federated setting where each organization or community trains locally—the usual “average‑the‑rewards” approach can drown out minority viewpoints. This paper proposes a systematic way to evaluate how we should combine those disparate preference signals, and it introduces an adaptive aggregation scheme that balances alignment quality with fairness across groups.
Key Contributions
- Evaluation framework for measuring the trade‑off between alignment performance and fairness of reward aggregation in federated RLHF.
- Comprehensive benchmark on question‑answering tasks using a PPO‑based RLHF pipeline, covering three classic aggregators (min, max, average).
- Novel adaptive aggregation algorithm that re‑weights each group’s reward signal based on its historical alignment success, without ever transmitting raw data.
- Empirical evidence that the adaptive method improves fairness (more equitable performance across groups) while keeping overall alignment scores on par with the best static baselines.
- Open‑source reference implementation (code and scripts) to help practitioners reproduce and extend the experiments.
Methodology
- Federated RLHF setup – Each participating group (e.g., a company, a regional user cohort) runs a local RLHF loop: it samples model rollouts, collects human preference judgments, and computes a scalar reward signal. No raw text or user data leaves the group.
- Reward aggregation strategies – The central server receives only the per‑group reward values and combines them using:
- Min (worst‑case),
- Max (best‑case),
- Average (standard), and
- Adaptive (proposed): a moving‑average weight for each group that grows when that group’s rewards lead to higher downstream alignment metrics.
- Training pipeline – The aggregated reward drives a PPO (Proximal Policy Optimization) update on the global LLM. The process repeats for several federated rounds.
- Metrics –
- Alignment score: standard RLHF evaluation (e.g., win‑rate against a reference model on Q/A).
- Fairness index: variance or disparity of alignment scores across groups (lower variance = higher fairness).
- Experimental protocol – Three heterogeneous user groups with distinct preference distributions were simulated. Each experiment ran multiple random seeds to ensure statistical reliability.
Results & Findings
| Aggregator | Avg. Alignment Score ↑ | Fairness (Std. Dev.) ↓ |
|---|---|---|
| Min | 71.2 % | 4.1 % |
| Max | 78.9 % | 9.8 % |
| Average | 77.4 % | 6.3 % |
| Adaptive | 77.1 % | 3.2 % |
- The adaptive scheme matches the top‑performing static aggregator (max) in raw alignment while halving the fairness disparity compared to average.
- Across all runs, the adaptive method consistently kept the worst‑performing group within 2 % of the best‑performing group, a notable improvement over the min and max baselines.
- Ablation studies show that the benefit stems from the dynamic weighting rather than simply smoothing; fixing the weights early on erodes the fairness gains.
Practical Implications
- Product teams can deploy RLHF pipelines that respect regional or demographic differences without centralizing sensitive feedback data—critical for GDPR‑compliant AI services.
- Marketplace AI platforms (e.g., code assistants, chatbots) can guarantee a baseline quality for all partner developers, reducing the risk of “model bias” complaints from minority user bases.
- Open‑source model maintainers gain a ready‑to‑use recipe for federated fine‑tuning that automatically balances performance and equity, lowering the engineering overhead of custom weighting schemes.
- The adaptive aggregator can be plugged into existing PPO‑based RLHF libraries (e.g., 🤗 TRL, OpenAI’s trl) with minimal code changes—just replace the reward‑averaging step with the provided weighting logic.
Limitations & Future Work
- Synthetic groups: The experiments used simulated preference distributions; real‑world federated deployments may exhibit more complex, non‑stationary behavior.
- Scalability: Weight updates are computed centrally; scaling to thousands of clients could introduce latency—future work could explore decentralized or hierarchical weighting.
- Reward granularity: Only scalar rewards were aggregated; richer feedback (e.g., multi‑dimensional preference vectors) might require more sophisticated fusion techniques.
- Broader tasks: The study focused on Q/A; extending to generation, summarization, or code synthesis could reveal task‑specific dynamics.
Bottom line: By systematically evaluating how we merge human preferences in federated RLHF, the authors provide both a diagnostic toolkit and a practical adaptive aggregator that helps developers build LLMs that are not just powerful, but also fair across the diverse users they serve.*
Authors
- Mahmoud Srewa
- Tianyu Zhao
- Salma Elmalaki
Paper Information
- arXiv ID: 2512.08786v1
- Categories: cs.CL, cs.AI
- Published: December 9, 2025
- PDF: Download PDF