[Paper] Fine-Tuning Regimes Define Distinct Continual Learning Problems

Published: (April 23, 2026 at 01:59 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.21927v1

Overview

Continual learning (CL) aims to let neural networks pick up new tasks one after another without catastrophically forgetting what they already know. This paper shows that how much of the model you allow to be fine‑tuned—the “trainable depth” or subspace of parameters you update—dramatically reshapes the learning dynamics and can flip the ranking of popular CL algorithms. In other words, the evaluation landscape itself is a hidden variable that researchers need to treat explicitly.

Key Contributions

  • Formalization of fine‑tuning regimes as projected optimization over fixed trainable subspaces, linking trainable depth to the effective update signal.
  • Empirical study across five depth regimes (from updating only the classifier head to fine‑tuning the whole network) for four widely used CL methods: online EWC, LwF, SI, and GEM.
  • Comprehensive benchmark covering five image datasets (MNIST, Fashion‑MNIST, KMNIST, QMNIST, CIFAR‑100) and 11 random task orders per dataset.
  • Discovery that method rankings are regime‑dependent: an algorithm that excels when only the head is trainable may fall behind others when deeper layers are updated.
  • Analysis of forgetting vs. update magnitude, revealing that deeper adaptation leads to larger weight changes, higher forgetting, and a tighter correlation between the two.
  • Call for regime‑aware evaluation protocols, positioning trainable depth as an explicit experimental factor in CL research.

Methodology

  1. Define trainable depth regimes – the authors fix a set of layers that remain trainable while all others are frozen. Five regimes range from “head‑only” (just the final linear layer) to “full‑network” fine‑tuning.
  2. Projected gradient descent – during training, gradients are projected onto the subspace spanned by the selected trainable parameters, ensuring that only those weights receive updates.
  3. Continual learning setup – task‑incremental CL is used: a sequence of classification tasks is presented, and after each task the model must retain performance on all previous tasks.
  4. Algorithms evaluated – four representative CL strategies are run under each regime:
    • Online Elastic Weight Consolidation (EWC) – regularizes changes to important weights.
    • Learning without Forgetting (LwF) – uses knowledge distillation to preserve prior behavior.
    • Synaptic Intelligence (SI) – accumulates an importance measure per weight.
    • Gradient Episodic Memory (GEM) – stores a small replay buffer and enforces gradient constraints.
  5. Metrics – average accuracy across tasks, forgetting measure (drop in performance on earlier tasks), and the norm of weight updates are recorded.
  6. Statistical robustness – each dataset is evaluated over 11 random task orderings, and results are aggregated to mitigate order bias.

Results & Findings

Trainable DepthBest‑performing CL method (average accuracy)
Head‑onlyLwF (≈ 92 % on MNIST family)
Shallow layersSI (≈ 88 % on CIFAR‑100)
Mid‑depthGEM (≈ 84 % on QMNIST)
Deep layersOnline EWC (≈ 78 % on CIFAR‑100)
Full‑networkNo clear winner; rankings shuffle
  • Ranking instability: The relative order of the four methods changes in almost every depth regime; no single algorithm dominates across all regimes.
  • Update magnitude grows with depth: When more layers are trainable, the L2 norm of weight updates roughly doubles, indicating a stronger learning signal but also more aggressive drift from previously learned representations.
  • Forgetting correlates with update size: Pearson correlation between update magnitude and forgetting rises from ~0.3 (head‑only) to ~0.7 (full‑network), confirming that deeper fine‑tuning amplifies catastrophic forgetting.
  • Dataset dependence: Simpler grayscale datasets (MNIST variants) are less sensitive to depth changes than the more complex CIFAR‑100, where deeper regimes cause pronounced performance drops.

Practical Implications

  • Model deployment pipelines: When integrating CL into production (e.g., edge devices that receive periodic updates), engineers must decide which layers to expose for online adaptation. Limiting updates to higher layers can preserve prior knowledge better, at the cost of slower adaptation.
  • Hyper‑parameter tuning: Fine‑tuning depth should be treated as a hyper‑parameter alongside learning rate, replay buffer size, or regularization strength. Automated ML (AutoML) tools could incorporate depth selection as part of the search space.
  • Benchmark design: Public CL benchmarks (e.g., ContinualAI’s CLBench) may need to publish results across multiple depth regimes, preventing “over‑fitting” to a single fine‑tuning setup.
  • Tooling for projected optimization: The projected gradient approach is straightforward to implement in PyTorch or TensorFlow (mask gradients with a binary mask per layer). This enables rapid experimentation with custom depth regimes.
  • Edge‑AI and privacy‑preserving updates: In scenarios where only a small subset of model parameters can be transmitted (bandwidth or privacy constraints), the findings guide which subset yields the best trade‑off between learning new tasks and retaining old ones.

Limitations & Future Work

  • Scope of algorithms: Only four CL methods were examined; newer replay‑based or meta‑learning approaches might behave differently under depth variation.
  • Architectural diversity: Experiments used standard CNNs; transformer‑based vision models or recurrent networks could exhibit distinct depth‑sensitivity patterns.
  • Task types: The study focuses on image classification with task‑incremental splits. Continual reinforcement learning or language modeling tasks may introduce additional dynamics.
  • Static depth regimes: The paper treats trainable depth as a fixed choice per experiment. Future work could explore dynamic depth scheduling, where the model gradually unfreezes deeper layers as it gains confidence.
  • Theoretical bounds: While the authors provide empirical correlations, a formal analysis linking subspace dimensionality to forgetting bounds remains an open research direction.

Authors

  • Paul‑Tiberiu Iordache
  • Elena Burceanu

Paper Information

  • arXiv ID: 2604.21927v1
  • Categories: cs.LG
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »