[Paper] Degradation of Feature Space in Continual Learning
Source: arXiv
Source: arXiv:2602.06586v1
Overview
The paper Degradation of Feature Space in Continual Learning challenges a common assumption from classic deep‑learning pipelines: that forcing feature representations to be isotropic (i.e., equally spread in all directions) always improves model robustness. By probing this idea in the context of continual learning—where data arrives as a non‑stationary stream—the authors discover that isotropy can actually hurt performance, revealing a fundamental geometric mismatch between centralized and incremental training regimes.
Key Contributions
- Empirical investigation of feature‑space isotropy in continual learning, a setting where most prior work focuses on plasticity‑stability trade‑offs but not on representation geometry.
- Contrastive continual‑learning experiments on CIFAR‑10 and CIFAR‑100 that compare vanilla continual learning, isotropy‑regularized variants, and baseline centralized training.
- Evidence that isotropy regularization degrades accuracy in streaming scenarios, contrary to its proven benefits in static, centralized training.
- Insightful analysis of anisotropy emergence as an intrinsic by‑product of incremental updates, suggesting that anisotropic features may be a useful inductive bias for non‑stationary data.
- Guidelines for future algorithm design, warning researchers against blindly transplanting centralized‑training tricks into continual‑learning pipelines.
Methodology
Continual‑learning backbone – The authors adopt a standard class‑incremental learning protocol: a ResNet‑18 model is trained on a sequence of tasks derived from CIFAR‑10/100, each task introducing new classes while retaining access only to the current task’s data.
Contrastive regularization – To encourage isotropy, they augment the loss with a feature‑space isotropy term that penalizes deviations from a spherical covariance matrix (similar to the “whitening” or “uniformity” losses used in contrastive learning).
Baselines – Three setups are compared:
a. vanilla continual learning (no isotropy term)
b. isotropy‑regularized continual learning
c. a centralized model trained on the full dataset (the gold‑standard for isotropy benefits).Metrics – Accuracy after each incremental step, as well as geometric diagnostics (eigenvalue spread of the feature covariance, cosine similarity distribution) are recorded to quantify isotropy vs. anisotropy.
Ablation – The strength of the isotropy regularizer is swept across several values to ensure the observed effect isn’t a hyper‑parameter artifact.
Results & Findings
| Setting | Final Accuracy (CIFAR‑10) | Final Accuracy (CIFAR‑100) | Feature‑Covariance Eigen‑Spread |
|---|---|---|---|
| Centralized (full data) | 92.1 % | 71.4 % | Near‑uniform (low spread) |
| Vanilla Continual | 78.3 % | 48.9 % | Moderate anisotropy |
| Isotropy‑Regularized | 76.5 % (↓ 1.8) | 46.2 % (↓ 2.7) | Forced uniformity (high regularizer loss) |
- Accuracy drops when isotropy is enforced, even though the regularizer successfully flattens the eigenvalue distribution.
- Anisotropic features naturally emerge as the model adapts to new tasks; this anisotropy correlates with better retention of earlier knowledge.
- Contrastive loss alone (without isotropy) improves representation quality, but the added isotropy term negates those gains.
Takeaway: Making the feature space “spherical” harms the delicate balance between learning new information (plasticity) and preserving old knowledge (stability) in a streaming environment.
Practical Implications
- Avoid copying centralized tricks – Techniques such as batch‑norm whitening, uniformity losses, or explicit isotropy regularizers that are common in static training can be counter‑productive for on‑device or edge continual‑learning systems.
- Design anisotropy‑aware architectures – When building models for lifelong learning (e.g., robotics, personalized assistants, autonomous vehicles), prefer regularizers that preserve or adapt the natural anisotropy rather than suppress it.
- Monitor feature geometry – Simple diagnostics—e.g., eigenvalue spread or cosine‑similarity histograms—can be added to training pipelines to flag when a model becomes overly isotropic, providing an early warning for potential forgetting.
- Consider resource constraints – Isotropy regularization adds extra computation (covariance estimation, additional loss terms) without clear benefit; omitting it can save memory and FLOPs on embedded devices.
Takeaway:
The work encourages the community to adopt geometry‑conscious continual‑learning designs, treating the shape of the representation space as a first‑class hyper‑parameter.
Limitations & Future Work
Limitations
- Dataset scope – Experiments are confined to CIFAR‑10/100; larger‑scale or domain‑specific streams (e.g., video, language) may exhibit different dynamics.
- Single backbone – Only ResNet‑18 is evaluated; other architectures (Vision Transformers, recurrent nets) could respond differently to isotropy constraints.
- Regularizer formulation – The study employs a straightforward isotropy penalty; more sophisticated approaches (e.g., task‑aware covariance shaping) might mitigate the observed degradation.
- Theoretical grounding – Empirical evidence is strong, yet a formal analysis linking anisotropy to the plasticity‑stability trade‑off remains an open research direction.
Future Work
- Develop adaptive regularizers that learn the optimal degree of isotropy for each task.
- Investigate how anisotropic feature spaces interact with replay‑based and parameter‑isolation continual‑learning strategies.
- Extend experiments to diverse datasets and model families to validate the generality of the findings.
Authors
- Eduard Angelats
- Paolo Dini
- Chiara Lanza
- Marco Miozzo
- Roberto Pereira
Paper Information
| Item | Details |
|---|---|
| arXiv ID | 2602.06586v1 |
| Categories | cs.LG, cs.DC |
| Published | February 6, 2026 |
| Download PDF |