[Paper] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Published: 3 days ago (February 25, 2026 at 12:54 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.22146v1

Overview

This paper tackles a core challenge in aligning large language models (LLMs) with human values: how to reliably train them under safety constraints using Reinforcement Learning from Human Feedback (RLHF). The authors introduce a new optimistic primal‑dual (OPD) algorithm that provably converges in the last iterate—the actual model you deploy—bridging the gap between elegant theory and the messy reality of parameterized neural‑network policies.

Key Contributions

Unified primal‑dual framework that subsumes most existing safe‑RLHF methods (single‑shot, multi‑shot, and “safe‑RLHF” variants).
Optimistic primal‑dual (OPD) algorithm that adds predictive (look‑ahead) updates for both policy (primal) and constraint (dual) variables, damping the oscillations typical of constrained RL.
Last‑iterate convergence guarantees for:
1. Exact policy optimization in the distributional (non‑parameterized) space.
2. Parameterized policies, showing convergence to a small neighborhood whose radius depends on approximation and bias errors.
Theoretical insight that optimism—common in online learning—acts as a stabilizer for constrained alignment objectives, a missing piece in prior RLHF theory.
Broad applicability: the analysis works for any convex‑concave saddle‑point formulation of safe RLHF, making it a “plug‑and‑play” upgrade for many existing pipelines.

Methodology

Problem formulation – The safe RLHF task is cast as a constrained optimization: maximize expected human‑feedback reward while keeping a safety‑related cost below a threshold. This yields a Lagrangian saddle‑point problem with primal variables (the policy) and dual variables (Lagrange multipliers).
Optimistic updates – Instead of the classic primal‑dual gradient steps, OPD first predicts the next primal and dual points using the current gradients, then evaluates the gradients at these predicted points to perform the actual update. This “extra look‑ahead” reduces the tendency of the iterates to chase each other in circles.
Analysis pipeline –
- For the distributional case, the authors prove that the OPD iterates converge linearly to the exact saddle point.
- For parameterized policies (e.g., neural networks), they bound the error introduced by function approximation and show that the iterates converge to a neighborhood whose size scales with these errors.
Unification – By expressing existing safe‑RLHF algorithms as special choices of step‑sizes and update rules within the same primal‑dual template, the paper demonstrates that OPD can replace them without redesigning the whole training loop.

Results & Findings

Theoretical guarantees: OPD achieves last‑iterate convergence, unlike standard primal‑dual methods that only guarantee convergence of an average of iterates. This is crucial because practitioners deploy the final model, not an average.
Stability: The optimistic step eliminates the high‑frequency oscillations observed in constrained RL training, leading to smoother loss curves and more predictable constraint satisfaction.
Error dependence: In the parameterized setting, the distance to the true optimum is bounded by a term proportional to the policy’s approximation error and any bias from stochastic gradient estimates. This quantifies how model capacity and data quality affect alignment quality.
Empirical validation (briefly reported): Experiments on synthetic constrained bandit problems and a small‑scale LLM alignment task show that OPD reaches higher reward while respecting safety constraints faster than vanilla primal‑dual or projected gradient methods.

Practical Implications

Deploy‑ready models: Developers can now rely on the final checkpoint of a safe‑RLHF run, reducing the need for post‑hoc averaging or checkpoint selection heuristics.
Plug‑in upgrade: Existing RLHF pipelines (e.g., OpenAI’s PPO‑based fine‑tuning, Anthropic’s constitutional AI loops) can incorporate the OPD update rule with minimal code changes, gaining stability without redesigning the reward model.
Safety‑first training: The tighter control over constraint violation makes OPD attractive for regulated domains (healthcare, finance, content moderation) where exceeding safety budgets is unacceptable.
Resource efficiency: By converging faster and avoiding oscillatory waste, OPD can cut the number of RLHF epochs, saving compute and carbon footprints—an important consideration for large‑scale LLM fine‑tuning.
Guidance for model selection: The explicit error‑bound term helps engineers decide how much model capacity is needed to meet a desired safety‑reward trade‑off, turning a vague “bigger is better” intuition into a quantitative design rule.

Limitations & Future Work

Assumption of convex‑concave structure: The convergence proofs rely on convexity in the policy distribution space, which may not hold for highly non‑convex neural‑network parameterizations.
Approximation error dependence: The neighborhood guarantee scales with the policy’s representational error; extremely under‑parameterized models could still violate constraints appreciably.
Empirical scope: Experiments are limited to modest‑size models and synthetic tasks; scaling the method to billions‑parameter LLMs remains an open engineering challenge.
Extension to multiple constraints: While the framework can handle a single safety cost, handling many interacting constraints (e.g., fairness, toxicity, latency) may require more sophisticated dual dynamics.

Future research directions include: extending OPD to fully non‑convex settings via variance‑reduced or adaptive optimism, integrating it with off‑policy data reuse (e.g., replay buffers), and benchmarking on real‑world LLM alignment suites with multi‑objective safety metrics.

Authors

Yining Li
Peizhong Ju
Ness Shroff

Paper Information

arXiv ID: 2602.22146v1
Categories: cs.LG, cs.AI
Published: February 25, 2026
PDF: Download PDF

[Paper] Provable Last-Iterate Convergence for Multi-Objective Safe LLM Alignment via Optimistic Primal-Dual

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Model Agreement via Anchoring

[Paper] SeeThrough3D: Occlusion Aware 3D Control in Text-to-Image Generation

[Paper] A Dataset is Worth 1 MB

[Paper] SOTAlign: Semi-Supervised Alignment of Unimodal Vision and Language Models via Optimal Transport