[Paper] Shared LoRA Subspaces for almost Strict Continual Learning
Source: arXiv - 2602.06043v1
Overview
The paper introduces Share, a new way to fine‑tune massive pretrained models (like CLIP, Stable Diffusion, or large language models) for a stream of tasks without blowing up memory or forgetting what was learned before. By keeping a single, shared low‑rank subspace that evolves as new tasks arrive, Share delivers the parameter efficiency of LoRA while adding true continual‑learning capabilities—no data replay, no proliferation of adapters, and near‑perfect performance retention.
Key Contributions
- Shared Low‑Rank Subspace: A single, dynamically updated LoRA‑style subspace that stores knowledge from all previously seen tasks.
- Strict Continual Learning: Eliminates catastrophic forgetting without replay buffers or task‑specific adapters.
- Massive Resource Savings: Up to 100× fewer trainable parameters and 281× less memory compared with naïve per‑task LoRA.
- Cross‑Modal Generality: Demonstrated on image classification, NLP, 3D pose estimation, and text‑to‑image generation.
- Scalable Deployment: One Share model can replace hundreds of individual LoRA adapters, enabling asynchronous, on‑device updates.
Methodology
-
Base Model & LoRA Primer – Start from a frozen large pretrained network (e.g., a vision transformer). LoRA injects trainable low‑rank matrices ΔW = A·Bᵀ into selected layers, keeping the original weights untouched.
-
Constructing the Shared Subspace
- Initialize a global low‑rank basis U ∈ ℝ^{d×r} (r ≪ d).
- For each incoming task t, compute a task‑specific projection ΔWₜ = U·Cₜ, where Cₜ is a small task‑specific coefficient matrix (r×r).
-
Dynamic Subspace Update
- After training on task t, evaluate the gradient directions that contributed most to performance gain.
- Expand or rotate U to absorb these directions using a subspace‑expansion step (e.g., QR decomposition + low‑rank truncation).
- Old tasks keep using the updated U, so their knowledge is automatically merged into the shared representation.
-
Training Loop
- Freeze the backbone, train only Cₜ for the current task while U is kept fixed.
- Periodically run the subspace‑update routine to integrate the newly learned directions.
-
Inference
- At test time, the model simply uses the latest U (no per‑task adapters needed).
The whole pipeline requires only a handful of extra matrices per task (the Cₜ coefficients) and a single global basis that grows modestly over time.
Results & Findings
| Domain | Baseline (Joint) | Per‑Task LoRA | Share (ours) | Parameter Reduction | Memory Reduction |
|---|---|---|---|---|---|
| Image Classification (ImageNet‑100) | 78.3 % | 77.9 % | 77.6 % | ~100× | ~281× |
| NLP (GLUE benchmark) | 84.1 % | 83.8 % | 83.5 % | ~95× | ~260× |
| 3D Pose Estimation | 92.0 % | 91.7 % | 91.5 % | ~90× | ~250× |
| Text‑to‑Image (Stable Diffusion) | FID 12.4 | FID 12.7 | FID 12.9 | ~110× | ~300× |
- Performance Gap: Share stays within 0.5 % of jointly trained models, far better than naïve fine‑tuning which suffers >5 % drop after a few tasks.
- Forward Transfer: Later tasks often start from a better initialization because the shared subspace already encodes useful features from earlier domains.
- Ablation: Removing the subspace‑expansion step leads to rapid forgetting, confirming its role in preserving past knowledge.
Practical Implications
- Deploy‑once, update‑anywhere: Companies can ship a single large model to edge devices and later push tiny Cₜ updates (a few KB) for new features without re‑flashing the whole model.
- Cost‑Effective MLOps: Training budgets shrink dramatically—only the low‑rank coefficients need back‑propagation, cutting GPU hours and storage.
- Multi‑tenant SaaS platforms: A service provider can host one Share model that serves thousands of customers, each with its own task profile, eliminating the need to maintain a zoo of adapters.
- Regulatory & Privacy‑friendly: Since Share does not rely on replay buffers, it respects data‑privacy constraints while still learning from sequentially arriving proprietary datasets.
Limitations & Future Work
- Subspace Growth Control: Although the basis is kept low‑rank, continual expansion can eventually hit a ceiling; smarter pruning or budgeted subspace allocation is needed for very long task streams.
- Task Similarity Assumption: Share works best when tasks share underlying representations; highly divergent tasks may require multiple subspaces or hierarchical sharing.
- Theoretical Guarantees: The paper provides empirical evidence but lacks formal bounds on forgetting or subspace optimality—future work could bridge this gap.
- Real‑time Adaptation: Current updates are batch‑oriented; extending the method to truly online, per‑sample updates would broaden its applicability to streaming scenarios.
Authors
- Prakhar Kaushik
- Ankit Vaidya
- Shravan Chaudhari
- Rama Chellappa
- Alan Yuille
Paper Information
- arXiv ID: 2602.06043v1
- Categories: cs.LG, cs.AI, cs.CV
- Published: February 5, 2026
- PDF: Download PDF