[Paper] The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Published: (February 17, 2026 at 01:39 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.15799v1

Overview

Fine‑tuning a large language model (LLM) that has already been “aligned” for safety can unexpectedly erode those safety guardrails—even when the downstream task is completely benign and the training data contain no harmful content. The paper The Geometry of Alignment Collapse: When Fine‑Tuning Breaks Safety uncovers why the common belief that fine‑tuning updates stay orthogonal to safety‑critical directions is misleading, and it shows that the geometry of the loss landscape itself drives a systematic drift into unsafe regions.

Key Contributions

  • Geometric instability proof – Demonstrates that orthogonality between fine‑tuning gradients and safety directions is structurally unstable under gradient descent dynamics.
  • Alignment Instability Condition (AIC) – Introduces three geometric properties (low‑dimensional safety subspace, sharp curvature, and curvature coupling) that together guarantee safety degradation.
  • Quartic scaling law – Shows that alignment loss grows proportionally to the fourth power of training time, linking the rate of safety decay to curvature metrics of the alignment manifold.
  • Curvature‑aware diagnostic framework – Provides a practical set of tools (e.g., Hessian‑based sharpness estimators) for predicting when a fine‑tuning run will breach safety.
  • Empirical validation – Confirms the theory on several open‑weight LLMs (e.g., LLaMA‑2, Falcon) across diverse benign fine‑tuning tasks (summarization, code generation, Q&A).

Methodology

  1. Modeling the loss landscape – The authors treat the aligned model’s parameter space as a high‑dimensional manifold where safety constraints occupy a low‑dimensional subspace with unusually high curvature (think of a narrow ridge).

  2. First‑order vs. second‑order dynamics – While the initial gradient step may be orthogonal to the safety subspace, the curvature of the fine‑tuning loss introduces a second‑order acceleration term (via the Hessian) that nudges the trajectory toward the ridge.

  3. Deriving the AIC – By analyzing the eigen‑structure of the Hessian for both the alignment loss and the fine‑tuning loss, they identify three conditions that, when satisfied, guarantee the drift:

    • Low‑dimensional safety manifold
    • Sharp eigenvalues (high curvature) along that manifold
    • Non‑zero coupling between fine‑tuning gradients and the safety Hessian
  4. Theoretical scaling – Using Taylor expansions and stochastic differential equation approximations, they prove that the alignment loss grows as

    $$L_{\text{align}}(t) \sim \kappa , t^{4},$$

    where $\kappa$ aggregates curvature and coupling constants.

  5. Experimental pipeline – Fine‑tunes several pretrained LLMs on clean datasets (e.g., WikiSumm, CodeParrot) while tracking:

    • Alignment loss (via a held‑out safety probe)
    • Gradient/Hessian spectra
    • Emergence of unsafe generations (prompted by standard red‑team tests)

Results & Findings

ModelFine‑tuning taskAlignment loss after 10 k stepsUnsafe generations (↑)
LLaMA‑2‑7BSummarization0.12 → 0.48 (×4)+23 %
Falcon‑40BCode generation0.09 → 0.41 (×4.5)+31 %
Mistral‑7BQA0.11 → 0.45 (×4.1)+27 %
  • Quadratic vs. quartic growth – Simple linear or quadratic models dramatically under‑predict the observed safety loss; the quartic law fits the empirical curve with $R^{2}>0.96$.
  • Curvature as a predictor – Models with higher top‑eigenvalue of the safety Hessian (> 150) degrade faster, confirming the theoretical link.
  • Coupling matters – When the fine‑tuning loss shares even a small projection onto the safety Hessian (as low as 0.02 rad), the drift accelerates; completely decoupled tasks (synthetic control) show negligible safety loss.

Practical Implications

  1. Safety‑first fine‑tuning pipelines need curvature checks – Before launching a fine‑tuning job, compute a cheap Hessian‑vector product on a safety probe to estimate sharpness; high values flag a high risk of alignment collapse.
  2. Curvature‑aware optimizers – Techniques such as Sharpness‑Aware Minimization (SAM) or second‑order preconditioners can damp the acceleration term, keeping the trajectory away from the unsafe ridge.
  3. Dynamic safety monitoring – Instead of a one‑off red‑team test after fine‑tuning, continuously monitor the alignment loss (or its proxy) during training; early spikes can trigger early‑stop or rollback.
  4. Model‑card updates – Release notes for fine‑tuned models should now include a “curvature profile” alongside traditional metrics (accuracy, FLOPs).
  5. Tooling for developers – The paper’s diagnostic code (open‑sourced) can be wrapped into popular libraries (🤗 Transformers, DeepSpeed) to automatically warn developers when the AIC is likely to be satisfied.

Limitations & Future Work

  • Hessian approximation cost – The current analysis relies on full‑batch Hessian eigen‑estimates, which are expensive for the largest LLMs; scalable stochastic approximations are needed.
  • Scope of tasks – Experiments focus on text‑centric tasks; it remains open how the phenomenon translates to multimodal fine‑tuning (e.g., vision‑language models).
  • Mitigation strategies not fully vetted – While curvature‑aware optimizers show promise, systematic benchmarks across diverse downstream applications are still pending.
  • Theoretical assumptions – The quartic scaling law assumes smooth loss surfaces and small learning rates; real‑world training with large batch sizes or adaptive schedulers may deviate.

The authors suggest extending the geometric framework to meta‑learning scenarios, exploring curvature‑regularized pre‑training, and building a public “alignment curvature leaderboard” to benchmark safe fine‑tuning practices.

Authors

  • Max Springer
  • Chung Peng Lee
  • Blossom Metevier
  • Jane Castleman
  • Bohdan Turbal
  • Hayoung Jung
  • Zeyu Shen
  • Aleksandra Korolova

Paper Information

  • arXiv ID: 2602.15799v1
  • Categories: cs.LG, cs.AI
  • Published: February 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »