[Paper] Uncovering Cross-Objective Interference in Multi-Objective Alignment
Source: arXiv - 2602.06869v1
Overview
Large language models (LLMs) are increasingly trained to satisfy multiple alignment objectives—such as helpfulness, harmlessness, and factuality—by scalarizing them into a single training signal. Lu and Jiang uncover a systematic failure mode: improving one objective can unintentionally degrade the others, a phenomenon they name cross‑objective interference. Their work not only explains why this happens but also offers a lightweight fix that can be dropped into existing alignment pipelines.
Key Contributions
- Formal definition of cross‑objective interference and a taxonomy of how it manifests across popular scalarization methods (linear weighting, Pareto‑based, etc.).
- Local covariance analysis showing that an objective’s first‑order improvement is tied to the positive covariance between its reward and the scalarized training signal.
- Extension of the covariance law to clipped surrogate objectives (e.g., PPO‑style clipping), proving the law still holds under mild assumptions.
- Covariance Targeted Weight Adaptation (CTWA): a plug‑and‑play algorithm that dynamically re‑weights objectives to preserve positive covariance during training.
- Global convergence guarantees under the Polyak‑Łojasiewicz (PL) condition, linking interference severity to model geometry (e.g., curvature of the loss landscape).
- Extensive empirical study across several LLM sizes and alignment setups, demonstrating that interference is pervasive and model‑dependent, while CTWA consistently reduces it.
Methodology
-
Problem Formalization
- Treat each alignment goal as a separate reward function (r_i(\theta)).
- Scalarized training uses a weighted sum (L(\theta)=\sum_i w_i r_i(\theta)) (or clipped surrogate variants).
-
Local Covariance Law
- Derive the first‑order change of each objective after a gradient step:
[ \Delta r_i \approx \eta , \text{Cov}\big(r_i, L\big) ] - Positive covariance ⇒ expected improvement; negative covariance ⇒ interference.
- Derive the first‑order change of each objective after a gradient step:
-
Clipping Extension
- Show that when using PPO‑style clipping, the covariance term survives as long as the clipping threshold does not truncate the majority of the gradient signal.
-
CTWA Algorithm
- At each training iteration, estimate (\text{Cov}(r_i, L)) on a mini‑batch.
- Adjust weights (w_i) proportionally to keep all covariances non‑negative (e.g., increase weight for objectives with low/negative covariance, decrease otherwise).
- No extra forward passes; weight updates are cheap and can be applied to any existing scalarization pipeline.
-
Theoretical Guarantees
- Under the PL condition (a relaxed form of strong convexity common in deep nets), prove that the scalarized loss converges globally.
- Derive how the convergence rate depends on the spectral properties of the Jacobian of the reward vector, linking model geometry to interference magnitude.
-
Empirical Evaluation
- Benchmarks on LLaMA‑7B, LLaMA‑13B, and a 70B instruction‑tuned model.
- Objectives: helpfulness (human preference), harmlessness (toxicity filter), factuality (ground‑truth QA).
- Metrics: per‑objective reward improvement, overall win‑rate, and a newly introduced Interference Index (average negative covariance).
Results & Findings
| Model | Baseline (static weights) | CTWA (weights) | Interference Index ↓ | Avg. per‑objective gain |
|---|---|---|---|---|
| LLaMA‑7B | 0.71 / 0.64 / 0.58 | 0.78 / 0.71 / 0.66 | 0.12 → 0.04 | +7 % helpful, +9 % harmless, +8 % factual |
| LLaMA‑13B | 0.74 / 0.68 / 0.62 | 0.80 / 0.74 / 0.70 | 0.15 → 0.05 | +6 % / +9 % / +9 % |
| 70B | 0.78 / 0.73 / 0.68 | 0.83 / 0.78 / 0.74 | 0.18 → 0.06 | +5 % / +7 % / +9 % |
- Cross‑objective interference is ubiquitous: even with carefully tuned static weights, at least one objective degrades in >30 % of training steps.
- CTWA eliminates most negative covariance while preserving overall training speed (≤ 3 % extra compute).
- Convergence analysis matches practice: models that satisfy the PL‑like condition (larger, smoother loss surfaces) show faster reduction of interference.
Practical Implications
- Plug‑and‑play for existing pipelines – CTWA can be added to any RLHF or supervised fine‑tuning loop that uses scalarized rewards, requiring only a covariance estimate per batch.
- More reliable multi‑objective alignment – developers can now safely add new objectives (e.g., privacy, energy efficiency) without fearing hidden regressions.
- Better debugging tools – the covariance metric gives a quantitative “interference heatmap” that highlights which objectives are at odds, guiding data collection or reward redesign.
- Potential cost savings – by avoiding repeated re‑training cycles to rebalance static weights, teams can converge to a balanced model faster.
- Framework integration – the authors released a lightweight PyTorch wrapper; early adopters can integrate it with Hugging Face
Trainer, DeepSpeed, or custom RLHF loops.
Limitations & Future Work
- Covariance estimation noise: on very small batches the covariance signal can be noisy, leading to occasional over‑adjustment of weights.
- Assumption of PL‑like landscape: the global convergence proof hinges on a Polyak‑Łojasiewicz condition that may not hold for highly non‑convex fine‑tuning regimes (e.g., instruction‑following with massive prompt diversity).
- Scalability to dozens of objectives: the current formulation scales linearly with the number of objectives; future work could explore low‑rank approximations or hierarchical weighting.
- Interaction with adversarial training: how CTWA behaves when some objectives are adversarially defined (e.g., robustness) remains open.
The authors suggest extending the covariance framework to meta‑learning of objective weights and exploring second‑order geometric insights that could further reduce interference in ultra‑large models.
Authors
- Yining Lu
- Meng Jiang
Paper Information
- arXiv ID: 2602.06869v1
- Categories: cs.CL, cs.LG
- Published: February 6, 2026
- PDF: Download PDF