[Paper] Vision Transformer Finetuning Benefits from Non-Smooth Components
Source: arXiv - 2602.06883v1
Overview
The paper investigates why Vision Transformers (ViTs) sometimes fine‑tune better than expected, challenging the common belief that smoother (i.e., less sensitive) models are always preferable. By introducing the notion of plasticity—the average rate at which a component’s output changes in response to input perturbations—the authors show that the less smooth parts of a ViT (attention heads and feed‑forward layers) are actually the most valuable when adapting a pre‑trained model to a new task.
Key Contributions
- Plasticity Metric: Proposes a simple, theoretically‑grounded measure of a layer’s sensitivity to input changes, complementing existing smoothness analyses.
- Theoretical Insight: Demonstrates analytically that high plasticity correlates with a larger capacity to adjust representations during transfer learning.
- Empirical Validation: Runs extensive fine‑tuning experiments on multiple vision benchmarks (ImageNet‑A, CIFAR‑10/100, VTAB) showing that prioritizing high‑plasticity components yields consistent performance gains.
- Practical Guidance: Provides a concrete recipe—freeze low‑plasticity layers (e.g., early embedding layers) and fine‑tune high‑plasticity ones (attention + feed‑forward) for better sample efficiency.
- Open‑Source Code: Releases a lightweight toolbox (
vit‑plasticity) for measuring plasticity and reproducing the experiments.
Methodology
- Defining Plasticity – For a given module (f(\cdot)), plasticity is computed as the average norm of the Jacobian (\mathbb{E}_{x}|\nabla_x f(x)|). Intuitively, it quantifies how much the module’s output “wiggles” when the input is nudged.
- Layer‑wise Analysis – The authors calculate plasticity for each ViT block (embedding, multi‑head attention, feed‑forward) on a held‑out validation set of the source task.
- Fine‑tuning Protocols – They compare three strategies across several downstream datasets:
- Uniform: fine‑tune all layers.
- Low‑Plasticity Freeze: freeze layers with the lowest plasticity scores.
- High‑Plasticity Focus: only fine‑tune the top‑plasticity layers (attention + feed‑forward).
- Metrics – Standard top‑1 accuracy, calibration error, and training stability (gradient norm variance) are reported.
- Ablation Studies – Vary the number of frozen layers, test different ViT sizes (ViT‑B/16, ViT‑L/32), and compare against alternative smoothness‑based heuristics.
Results & Findings
- Higher Accuracy: Across all downstream tasks, the High‑Plasticity Focus strategy matches or exceeds full‑model fine‑tuning by 0.5–2.3 % absolute accuracy while using 30–50 % fewer trainable parameters.
- Faster Convergence: Models that only update high‑plasticity modules converge in roughly half the epochs needed for full fine‑tuning.
- Robustness: Plasticity‑guided fine‑tuning yields lower calibration error and is less prone to catastrophic forgetting of the source task.
- Layer Ranking Consistency: Attention and feed‑forward layers consistently rank among the top‑3 in plasticity, regardless of ViT depth or pre‑training dataset.
- Theoretical Alignment: Empirical trends align with the derived bound showing that higher Jacobian norms increase the capacity to reshape feature manifolds during transfer.
Practical Implications
- Efficient Transfer Learning: Developers can dramatically cut GPU memory and training time by freezing low‑plasticity layers (often the early patch embedding and positional encoding) and only updating attention/feed‑forward blocks.
- Resource‑Constrained Scenarios: Edge‑AI pipelines that need to adapt a large pre‑trained ViT on‑device can now do so with a fraction of the compute budget.
- Model Compression & Pruning: Plasticity scores can guide which weights to prune or quantize without harming fine‑tuning performance.
- Automated Fine‑tuning Tools: The released
vit‑plasticitylibrary can be integrated into MLOps pipelines to automatically select the optimal fine‑tuning schedule per downstream task. - Beyond Vision: The plasticity concept is architecture‑agnostic, suggesting similar strategies could improve transfer for language transformers, multimodal models, or even graph neural networks.
Limitations & Future Work
- Scope of Architectures: Experiments focus on vanilla ViT variants; it remains unclear how plasticity behaves in hybrid models (e.g., Swin‑Transformer, Conv‑ViT).
- Dataset Diversity: While several benchmarks were used, the study does not cover extreme domain shifts (e.g., medical imaging) where low‑plasticity layers might still carry crucial domain‑specific priors.
- Static Plasticity Measurement: Plasticity is measured on the source task only; dynamic re‑evaluation during fine‑tuning could further refine which layers to unfreeze.
- Theoretical Bounds: The current analysis provides a high‑level bound; tighter, task‑specific guarantees are an open research direction.
Bottom line: By flipping the smoothness narrative on its head, this work equips practitioners with a data‑driven rule of thumb—focus on the “wiggly” parts of Vision Transformers—to achieve faster, cheaper, and often more accurate fine‑tuning.
Authors
- Ambroise Odonnat
- Laetitia Chapel
- Romain Tavenard
- Ievgen Redko
Paper Information
- arXiv ID: 2602.06883v1
- Categories: cs.LG, cs.CV, stat.ML
- Published: February 6, 2026
- PDF: Download PDF