[Paper] Influence Malleability in Linearized Attention: Dual Implications of Non-Convergent NTK Dynamics
Source: arXiv - 2603.13085v1
Overview
The paper uncovers a hidden trade‑off in linearized attention—a simplified version of the attention layers that power Transformers. By analysing the model through the lens of the Neural Tangent Kernel (NTK), the authors show that, unlike many wide neural networks, linearized attention fails to converge to its infinite‑width kernel limit under realistic model sizes. This “non‑convergent” behavior makes attention both more expressive (it can better match the structure of a task) and more fragile (it can be steered more easily by a few training examples).
Key Contributions
- Spectral amplification theorem: proves that the condition number of the Gram matrix is cubed by the attention transformation, implying that to reach NTK convergence the width must scale as (m = \Omega(\kappa^6)) (where (\kappa) is the Gram condition number).
- Empirical verification on natural‑image datasets that practical widths (e.g., (m \le 10^4)) are far below the theoretical threshold, confirming persistent non‑convergence.
- Influence malleability metric: quantifies how much a model’s predictions can be altered by re‑weighting individual training points; linearized attention shows 6–9× higher malleability than standard ReLU MLPs.
- Dual‑implication analysis: demonstrates that higher malleability can reduce approximation error (better alignment with task‑specific structure) but also increase vulnerability to adversarial training‑data attacks.
- Provides a data‑dependent Gram‑induced kernel interpretation of linearized attention, bridging kernel theory and modern attention mechanisms.
Methodology
Linearized attention formulation – The authors replace the softmax‑based attention with a linear map that can be written as a Gram matrix (G = X X^\top) (where (X) are the input embeddings). This yields an exact kernel representation: the network’s output equals a kernel regression with a data‑dependent kernel
[ K_{\text{att}}(x, x’) = \phi(x)^\top G^{-1} \phi(x’). ]
NTK analysis – Using the Neural Tangent Kernel framework, they derive the infinite‑width limit of the linearized attention network and compare it to the finite‑width dynamics. The key step is a spectral analysis of how the attention operation transforms the eigenvalues of the Gram matrix.
Spectral amplification proof – By bounding the eigenvalue spread before and after the attention step, they show that the condition number is raised to the third power, leading to the width requirement (m = \Omega(\kappa^6)).
Influence malleability measurement – They adopt the classic influence function formalism (Koh & Liang, 2017) and compute how much the model’s prediction changes when a single training example’s loss is perturbed. The ratio of the maximal influence to that of a baseline ReLU network defines “malleability.”
Experiments – Training linearized attention models of varying widths on CIFAR‑10/100 and ImageNet‑mini, they track NTK alignment (via cosine similarity of Jacobians) and malleability, and they run targeted data‑poisoning attacks to illustrate the security risk.
Results & Findings
| Aspect | Observation |
|---|---|
| NTK convergence | Even at widths of 8 K–16 K (far larger than typical Transformer heads), the finite‑width dynamics diverge noticeably from the NTK prediction. |
| Spectral amplification | Empirically, the condition number of the Gram matrix after attention is ≈ (\kappa^3), matching the theoretical bound. |
| Malleability | Linearized attention’s influence scores are 6–9× higher than those of a comparable ReLU MLP, confirming a stronger dependence on individual training points. |
| Approximation error | On clean test sets, the higher malleability translates into 2–4 % lower error compared with a kernel‑only baseline, showing a practical benefit. |
| Adversarial susceptibility | Simple data‑poisoning (flipping labels of < 1 % of training images) can cause > 15 % degradation in test accuracy, whereas the ReLU baseline degrades < 5 %. |
In short, the paper validates that linearized attention lives in a regime where kernel‑theoretic guarantees no longer hold, and that this regime both empowers and endangers the model.
Practical Implications
- Designing more robust Transformers – Adding regularization (e.g., spectral norm constraints on the Gram matrix) could temper malleability without sacrificing too much expressivity.
- Data‑centric debugging – Because a handful of training examples can swing predictions, developers should invest in influence‑based tooling (e.g., fast influence estimators) to spot “high‑impact” samples that might be noisy or malicious.
- Fine‑tuning strategies – When fine‑tuning large language or vision models, practitioners might prefer smaller attention heads or low‑rank approximations to keep the model closer to the kernel regime, reducing over‑sensitivity to limited fine‑tuning data.
- Adversarial training & data sanitization – The findings motivate data‑poisoning defenses (e.g., robust loss functions, gradient clipping) specifically targeted at attention layers, which are currently under‑protected compared to feed‑forward parts of the network.
- Kernel‑inspired initialization – Because the linearized attention kernel can be computed analytically, one could initialize a full Transformer with a kernel‑aligned weight distribution, potentially speeding up convergence and improving stability in early training epochs.
Overall, the work gives developers a concrete diagnostic (malleability) and a theoretical “warning sign” (spectral amplification) to watch for when building or deploying attention‑heavy models.
Limitations & Future Work
- Linearized vs. full softmax attention – The study focuses on a linearized variant; extending the spectral analysis to the standard softmax‑based attention remains an open challenge.
- Synthetic condition numbers – The condition number (\kappa) is measured on the Gram matrix of the raw embeddings; real‑world pipelines often include normalization, positional encodings, or learned projections that could alter the amplification effect.
- Scale of experiments – Experiments are limited to image classification benchmarks up to ImageNet‑mini; confirming the phenomena on massive language corpora (e.g., GPT‑style models) is left for future work.
- Mitigation strategies – While the paper hints at regularization and robust training, it does not provide a systematic recipe; subsequent research could develop practical algorithms that balance malleability and robustness.
Addressing these gaps will help translate theoretical insight into concrete engineering guidelines for safer, more reliable attention mechanisms.
Authors
- Jose Marie Antonio Miñoza
- Paulo Mario P. Medina
- Sebastian C. Ibañez
Paper Information
- arXiv ID: 2603.13085v1
- Categories: cs.LG, cs.CV, math.NA, stat.ML
- Published: March 13, 2026
- PDF: Download PDF