[Paper] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

Published: 2 months ago (December 9, 2025 at 11:39 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08786v1

Overview

Large language models (LLMs) are increasingly being fine‑tuned with human feedback (RLHF) to make them behave responsibly. When that feedback comes from many different user groups—think of a federated setting where each organization or community trains locally—the usual “average‑the‑rewards” approach can drown out minority viewpoints. This paper proposes a systematic way to evaluate how we should combine those disparate preference signals, and it introduces an adaptive aggregation scheme that balances alignment quality with fairness across groups.

Key Contributions

Evaluation framework for measuring the trade‑off between alignment performance and fairness of reward aggregation in federated RLHF.
Comprehensive benchmark on question‑answering tasks using a PPO‑based RLHF pipeline, covering three classic aggregators (min, max, average).
Novel adaptive aggregation algorithm that re‑weights each group’s reward signal based on its historical alignment success, without ever transmitting raw data.
Empirical evidence that the adaptive method improves fairness (more equitable performance across groups) while keeping overall alignment scores on par with the best static baselines.
Open‑source reference implementation (code and scripts) to help practitioners reproduce and extend the experiments.

Methodology

Federated RLHF setup – Each participating group (e.g., a company, a regional user cohort) runs a local RLHF loop: it samples model rollouts, collects human preference judgments, and computes a scalar reward signal. No raw text or user data leaves the group.
Reward aggregation strategies – The central server receives only the per‑group reward values and combines them using:
- Min (worst‑case)
- Max (best‑case)
- Average (standard)
- Adaptive (proposed): a moving‑average weight for each group that grows when that group’s rewards lead to higher downstream alignment metrics.
Training pipeline – The aggregated reward drives a PPO (Proximal Policy Optimization) update on the global LLM. The process repeats for several federated rounds.
Metrics –
- Alignment score: standard RLHF evaluation (e.g., win‑rate against a reference model on Q/A).
- Fairness index: variance or disparity of alignment scores across groups (lower variance = higher fairness).
Experimental protocol – Three heterogeneous user groups with distinct preference distributions were simulated. Each experiment ran multiple random seeds to ensure statistical reliability.

Results & Findings

Aggregator	Avg. Alignment Score ↑	Fairness (Std. Dev.) ↓
Min	71.2 %	4.1 %
Max	78.9 %	9.8 %
Average	77.4 %	6.3 %
Adaptive	77.1 %	3.2 %

The adaptive scheme matches the top‑performing static aggregator (max) in raw alignment while halving the fairness disparity compared to average.
Across all runs, the adaptive method consistently kept the worst‑performing group within 2 % of the best‑performing group, a notable improvement over the min and max baselines.
Ablation studies show that the benefit stems from the dynamic weighting rather than simply smoothing; fixing the weights early on erodes the fairness gains.

Practical Implications

Product teams can deploy RLHF pipelines that respect regional or demographic differences without centralizing sensitive feedback data—critical for GDPR‑compliant AI services.
Marketplace AI platforms (e.g., code assistants, chatbots) can guarantee a baseline quality for all partner developers, reducing the risk of “model bias” complaints from minority user bases.
Open‑source model maintainers gain a ready‑to‑use recipe for federated fine‑tuning that automatically balances performance and equity, lowering the engineering overhead of custom weighting schemes.
The adaptive aggregator can be plugged into existing PPO‑based RLHF libraries (e.g., 🤗 TRL, OpenAI’s trl) with minimal code changes—just replace the reward‑averaging step with the provided weighting logic.

Limitations & Future Work

Synthetic groups: The experiments used simulated preference distributions; real‑world federated deployments may exhibit more complex, non‑stationary behavior.
Scalability: Weight updates are computed centrally; scaling to thousands of clients could introduce latency—future work could explore decentralized or hierarchical weighting.
Reward granularity: Only scalar rewards were aggregated; richer feedback (e.g., multi‑dimensional preference vectors) might require more sophisticated fusion techniques.
Broader tasks: The study focused on Q/A; extending to generation, summarization, or code synthesis could reveal task‑specific dynamics.

Bottom line: By systematically evaluating how we merge human preferences in federated RLHF, the authors provide both a diagnostic toolkit and a practical adaptive aggregator that helps developers build LLMs that are not just powerful, but also fair across the diverse users they serve.

Authors

Mahmoud Srewa
Tianyu Zhao
Salma Elmalaki

Paper Information

arXiv ID: 2512.08786v1
Categories: cs.CL, cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] A Systematic Evaluation of Preference Aggregation in Federated RLHF for Pluralistic Alignment of LLMs

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models