[Paper] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models
Source: arXiv - 2604.16171v1
Overview
The paper “JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models” introduces a lightweight way to make adapters—tiny trainable modules added to frozen LLMs—both sparser and more isolated across tasks. By inserting a simple gating function (JumpReLU) into LoRA blocks, the authors achieve dynamic sparsity that curbs catastrophic forgetting while keeping the computational budget low. The result is a plug‑and‑play upgrade that lifts the performance of existing LoRA‑based continual‑learning pipelines and even beats the current state‑of‑the‑art method ELLA.
Key Contributions
- JumpReLU gating: A novel, train‑time gating mechanism that selectively deactivates rows/columns in LoRA matrices, inducing task‑specific sparsity on the fly.
- Dynamic parameter isolation: The gating creates “islands” of active parameters per task, reducing interference without needing explicit subspace constraints.
- Modular compatibility: JumpLoRA can be stacked on top of any LoRA‑based continual‑learning method (e.g., IncLoRA) with minimal code changes.
- Empirical gains: Across several benchmark CL streams (e.g., GLUE‑CL, continual QA), JumpLoRA + IncLoRA outperforms ELLA by up to 4.2 % absolute accuracy while using ≤ 30 % extra FLOPs.
- Open‑source implementation: The authors release a PyTorch library that integrates seamlessly with Hugging Face’s
peftadapters.
Methodology
- Base architecture: Start with a frozen pre‑trained LLM (e.g., LLaMA‑7B) and attach standard LoRA adapters (low‑rank matrices A and B) to the attention and feed‑forward layers.
- JumpReLU gate: For each LoRA matrix, a parallel gate vector g (same dimension as the low‑rank dimension) is learned. The forward pass multiplies the LoRA output by
JumpReLU(g), where JumpReLU is a piecewise‑linear function that outputs either 0 (gate closed) or a scaled positive value (gate open). - Task‑specific sparsity: When a new task arrives, the gate parameters are re‑initialized and trained only on that task’s data. Because the gate is binary‑like, many rows/columns stay at zero, effectively “turning off” parts of the adapter that belong to previous tasks.
- Training regime: The authors adopt a two‑stage schedule—first fine‑tune the gate with a small learning rate while freezing the LoRA weights, then jointly update both. A lightweight L1 regularizer on the gate encourages sparsity.
- Integration with CL strategies: Existing CL methods (e.g., IncLoRA) already maintain a separate LoRA per task. JumpLoRA simply adds the gate on top, so the same rehearsal or regularization tricks can be reused.
Results & Findings
| Dataset (Continual Setting) | Baseline (LoRA) | IncLoRA | ELLA (SOTA) | JumpLoRA + IncLoRA |
|---|---|---|---|---|
| GLUE‑CL (5 tasks) | 71.4 % | 74.1 % | 75.6 % | 78.8 % (+3.2 % over ELLA) |
| Continual QA (3 domains) | 62.7 % | 65.9 % | 66.5 % | 69.3 % (+2.8 % over ELLA) |
| Sentiment‑Stream (10 tasks) | 68.2 % | 70.5 % | 71.0 % | 73.1 % (+2.1 % over ELLA) |
- Parameter efficiency: Average sparsity per task reached ≈ 45 % of the LoRA weights, cutting memory usage by ~0.5 GB for a 7B model.
- Training speed: The extra gating adds < 5 % overhead per epoch, negligible compared to full fine‑tuning.
- Ablation: Removing the L1 regularizer or using a standard ReLU gate drops performance by ~1.5 %, confirming the importance of the JumpReLU design.
Practical Implications
- Plug‑and‑play adapters: Developers can retrofit existing LoRA‑based pipelines (e.g., for domain‑specific chatbots) with JumpLoRA to get better task separation without re‑architecting the whole system.
- Edge deployment: Because the gating yields sparse adapters, the final model footprint fits more comfortably on GPU‑memory‑constrained environments (e.g., inference on a single RTX 3080).
- Rapid product iteration: Companies that need to roll out new language‑understanding features (sentiment analysis, intent detection) on top of a frozen LLM can now add a new “task adapter” in minutes, with reduced risk of degrading previously shipped capabilities.
- Continual fine‑tuning services: Cloud providers offering “LLM as a service” can expose a “sparse‑adapter” endpoint, letting customers upload task data and receive a lightweight, isolated adapter bundle that can be swapped in at runtime.
Limitations & Future Work
- Task similarity handling: JumpLoRA treats each task independently; when tasks are highly related, the hard isolation may forgo beneficial knowledge transfer.
- Scalability to hundreds of tasks: Although memory stays low per adapter, the cumulative number of gates grows linearly, which could become a management overhead.
- Evaluation breadth: Experiments focus on classification and QA; applying the method to generation‑heavy tasks (e.g., code synthesis) remains an open question.
- Future directions: The authors suggest exploring soft gating schedules that allow controlled sharing, and integrating the approach with parameter‑efficient prompting (e.g., prefix‑tuning) for even tighter resource budgets.
Authors
- Alexandra Dragomir
- Ioana Pintilie
- Antonio Barbalau
- Marius Dragoi
- Florin Brad
- Cristian Daniel Paduraru
- Alexandru Tifrea
- Elena Burceanu
- Radu Tudor Ionescu
Paper Information
- arXiv ID: 2604.16171v1
- Categories: cs.LG, cs.AI, cs.CL
- Published: April 17, 2026
- PDF: Download PDF