[Paper] Degradation of Feature Space in Continual Learning

Published: (February 6, 2026 at 05:26 AM EST)
5 min read
Source: arXiv

Source: arXiv

Source: arXiv:2602.06586v1

Overview

The paper Degradation of Feature Space in Continual Learning challenges a common assumption from classic deep‑learning pipelines: that forcing feature representations to be isotropic (i.e., equally spread in all directions) always improves model robustness. By probing this idea in the context of continual learning—where data arrives as a non‑stationary stream—the authors discover that isotropy can actually hurt performance, revealing a fundamental geometric mismatch between centralized and incremental training regimes.

Key Contributions

  • Empirical investigation of feature‑space isotropy in continual learning, a setting where most prior work focuses on plasticity‑stability trade‑offs but not on representation geometry.
  • Contrastive continual‑learning experiments on CIFAR‑10 and CIFAR‑100 that compare vanilla continual learning, isotropy‑regularized variants, and baseline centralized training.
  • Evidence that isotropy regularization degrades accuracy in streaming scenarios, contrary to its proven benefits in static, centralized training.
  • Insightful analysis of anisotropy emergence as an intrinsic by‑product of incremental updates, suggesting that anisotropic features may be a useful inductive bias for non‑stationary data.
  • Guidelines for future algorithm design, warning researchers against blindly transplanting centralized‑training tricks into continual‑learning pipelines.

Methodology

  1. Continual‑learning backbone – The authors adopt a standard class‑incremental learning protocol: a ResNet‑18 model is trained on a sequence of tasks derived from CIFAR‑10/100, each task introducing new classes while retaining access only to the current task’s data.

  2. Contrastive regularization – To encourage isotropy, they augment the loss with a feature‑space isotropy term that penalizes deviations from a spherical covariance matrix (similar to the “whitening” or “uniformity” losses used in contrastive learning).

  3. Baselines – Three setups are compared:
    a. vanilla continual learning (no isotropy term)
    b. isotropy‑regularized continual learning
    c. a centralized model trained on the full dataset (the gold‑standard for isotropy benefits).

  4. Metrics – Accuracy after each incremental step, as well as geometric diagnostics (eigenvalue spread of the feature covariance, cosine similarity distribution) are recorded to quantify isotropy vs. anisotropy.

  5. Ablation – The strength of the isotropy regularizer is swept across several values to ensure the observed effect isn’t a hyper‑parameter artifact.

Results & Findings

SettingFinal Accuracy (CIFAR‑10)Final Accuracy (CIFAR‑100)Feature‑Covariance Eigen‑Spread
Centralized (full data)92.1 %71.4 %Near‑uniform (low spread)
Vanilla Continual78.3 %48.9 %Moderate anisotropy
Isotropy‑Regularized76.5 % (↓ 1.8)46.2 % (↓ 2.7)Forced uniformity (high regularizer loss)
  • Accuracy drops when isotropy is enforced, even though the regularizer successfully flattens the eigenvalue distribution.
  • Anisotropic features naturally emerge as the model adapts to new tasks; this anisotropy correlates with better retention of earlier knowledge.
  • Contrastive loss alone (without isotropy) improves representation quality, but the added isotropy term negates those gains.

Takeaway: Making the feature space “spherical” harms the delicate balance between learning new information (plasticity) and preserving old knowledge (stability) in a streaming environment.

Practical Implications

  • Avoid copying centralized tricks – Techniques such as batch‑norm whitening, uniformity losses, or explicit isotropy regularizers that are common in static training can be counter‑productive for on‑device or edge continual‑learning systems.
  • Design anisotropy‑aware architectures – When building models for lifelong learning (e.g., robotics, personalized assistants, autonomous vehicles), prefer regularizers that preserve or adapt the natural anisotropy rather than suppress it.
  • Monitor feature geometry – Simple diagnostics—e.g., eigenvalue spread or cosine‑similarity histograms—can be added to training pipelines to flag when a model becomes overly isotropic, providing an early warning for potential forgetting.
  • Consider resource constraints – Isotropy regularization adds extra computation (covariance estimation, additional loss terms) without clear benefit; omitting it can save memory and FLOPs on embedded devices.

Takeaway:
The work encourages the community to adopt geometry‑conscious continual‑learning designs, treating the shape of the representation space as a first‑class hyper‑parameter.

Limitations & Future Work

Limitations

  • Dataset scope – Experiments are confined to CIFAR‑10/100; larger‑scale or domain‑specific streams (e.g., video, language) may exhibit different dynamics.
  • Single backbone – Only ResNet‑18 is evaluated; other architectures (Vision Transformers, recurrent nets) could respond differently to isotropy constraints.
  • Regularizer formulation – The study employs a straightforward isotropy penalty; more sophisticated approaches (e.g., task‑aware covariance shaping) might mitigate the observed degradation.
  • Theoretical grounding – Empirical evidence is strong, yet a formal analysis linking anisotropy to the plasticity‑stability trade‑off remains an open research direction.

Future Work

  • Develop adaptive regularizers that learn the optimal degree of isotropy for each task.
  • Investigate how anisotropic feature spaces interact with replay‑based and parameter‑isolation continual‑learning strategies.
  • Extend experiments to diverse datasets and model families to validate the generality of the findings.

Authors

  • Eduard Angelats
  • Paolo Dini
  • Chiara Lanza
  • Marco Miozzo
  • Roberto Pereira

Paper Information

ItemDetails
arXiv ID2602.06586v1
Categoriescs.LG, cs.DC
PublishedFebruary 6, 2026
PDFDownload PDF
0 views
Back to Blog

Related posts

Read more »