[Paper] Reward-free Alignment for Conflicting Objectives
Source: arXiv - 2602.02495v1
Overview
The paper introduces Reward‑free Alignment for Conflicting Objectives (RACO), a new way to fine‑tune large language models (LLMs) when you have multiple user preferences that pull in opposite directions (e.g., “be helpful” vs. “be safe”). Instead of building separate reward models for each goal, RACO works directly with raw pairwise preference data and resolves gradient conflicts with a clipped, conflict‑averse optimizer. The authors prove that the method converges to sensible trade‑offs (Pareto‑critical points) and show empirically that it outperforms existing multi‑objective alignment baselines on real LLM families.
Key Contributions
- Reward‑free framework: Aligns LLMs using only pairwise human preferences, eliminating the need for handcrafted reward models for each objective.
- Conflict‑averse gradient descent (CAGD) with clipping: A novel optimizer that detects and clips conflicting gradient components, guaranteeing convergence to Pareto‑critical solutions respecting user‑specified objective weights.
- Theoretical guarantees: Proof of convergence to Pareto‑critical points and a provable speed‑up in the two‑objective case thanks to the clipping mechanism.
- Practical heuristics: Enhancements (e.g., dynamic weight adjustment, gradient normalization) that make the method robust across model sizes and datasets.
- Extensive empirical validation: Experiments on multi‑objective summarization and safety alignment across Qwen‑3, Llama‑3, and Gemma‑3 show superior Pareto front coverage compared to weighted‑loss and existing multi‑objective baselines.
Methodology
- Data collection – Human annotators provide pairwise comparisons of model outputs (e.g., “output A is more helpful than B, but less safe”). No scalar reward scores are needed.
- Gradient computation – For each objective (helpfulness, safety, etc.) the model computes a loss gradient from the corresponding preference pairs.
- Conflict detection – The gradients are examined for negative cosine similarity (i.e., they point in opposite directions).
- Clipped Conflict‑Averse GD – When a conflict is detected, the offending component of the gradient is clipped (set to zero) before aggregation, ensuring that updates never move the model away from any weighted objective. The aggregated update respects the user‑specified weight vector w (e.g., 0.7 helpfulness, 0.3 safety).
- Iterative fine‑tuning – The model is updated with the clipped, weighted gradient, and the process repeats until convergence to a Pareto‑critical point (no feasible direction improves all weighted objectives simultaneously).
The approach is “reward‑free” because it never converts preferences into scalar rewards; it works directly with the relative information that humans find easiest to provide.
Results & Findings
| Task | Model | Baseline (Weighted Loss) | RACO (w/ heuristics) | Pareto‑front improvement |
|---|---|---|---|---|
| Multi‑objective summarization (helpfulness vs. factuality) | Llama‑3 8B | 0.71/0.68 (BLEU / factuality) | 0.78 / 0.75 | +9% average |
| Safety alignment (harmlessness vs. usefulness) | Qwen‑3 7B | 0.62 / 0.80 | 0.70 / 0.86 | +13% (harmlessness) |
| Mixed‑objective benchmark (3 objectives) | Gemma‑3 2.8B | 0.55 / 0.73 / 0.68 | 0.62 / 0.78 / 0.74 | +12% overall |
- Convergence: RACO reaches Pareto‑critical points in ~30% fewer epochs than the weighted‑loss baseline.
- Stability: Gradient clipping eliminates the “oscillation” seen in naive multi‑objective training, leading to smoother loss curves.
- Qualitative: Human judges report that RACO‑tuned outputs better respect the intended trade‑off (e.g., safer answers without sacrificing relevance).
Practical Implications
- Simplified pipeline – Teams can skip the costly step of training separate reward models for each alignment goal, reducing engineering overhead and potential reward‑gaming bugs.
- Fine‑grained control – By adjusting the weight vector w, product managers can steer the model toward different operating points (e.g., more cautious for medical advice, more expressive for creative chat).
- Scalable to many objectives – The clipping mechanism works regardless of the number of objectives, opening the door to aligning LLMs with complex policy suites (privacy, bias, latency, etc.).
- Better safety‑utility balance – For developers deploying LLMs in regulated domains, RACO offers a provable way to keep safety metrics from being “washed out” by larger utility signals.
- Open‑source friendliness – The method relies only on preference data, which can be collected via existing annotation platforms, making it attractive for open‑source model communities.
Limitations & Future Work
- Preference quality – RACO’s performance hinges on high‑quality, unbiased pairwise data; noisy or contradictory annotations can still degrade the Pareto front.
- Scalability of clipping – While the clipping operation is cheap per step, the conflict detection cost grows linearly with the number of objectives, which may become a bottleneck for >10 objectives.
- Theoretical scope – Convergence guarantees are proven for smooth, convex‑like loss landscapes and the two‑objective case; extending proofs to highly non‑convex LLM loss surfaces remains open.
- Future directions – The authors suggest (1) adaptive clipping thresholds, (2) integration with reinforcement learning from human feedback (RLHF) loops, and (3) exploring hierarchical objective structures where some goals dominate others.
RACO demonstrates that you can align powerful language models to multiple, sometimes opposing, user expectations without the heavyweight machinery of separate reward models. For developers building responsible AI products, this could become a go‑to technique for achieving reliable, tunable trade‑offs in real‑world deployments.
Authors
- Peter Chen
- Xiaopeng Li
- Xi Chen
- Tianyi Lin
Paper Information
- arXiv ID: 2602.02495v1
- Categories: cs.CL, cs.AI, cs.LG
- Published: February 2, 2026
- PDF: Download PDF