[Paper] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Published: (April 17, 2026 at 11:38 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.16171v1

Overview

The paper “JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models” introduces a lightweight way to make adapters—tiny trainable modules added to frozen LLMs—both sparser and more isolated across tasks. By inserting a simple gating function (JumpReLU) into LoRA blocks, the authors achieve dynamic sparsity that curbs catastrophic forgetting while keeping the computational budget low. The result is a plug‑and‑play upgrade that lifts the performance of existing LoRA‑based continual‑learning pipelines and even beats the current state‑of‑the‑art method ELLA.

Key Contributions

  • JumpReLU gating: A novel, train‑time gating mechanism that selectively deactivates rows/columns in LoRA matrices, inducing task‑specific sparsity on the fly.
  • Dynamic parameter isolation: The gating creates “islands” of active parameters per task, reducing interference without needing explicit subspace constraints.
  • Modular compatibility: JumpLoRA can be stacked on top of any LoRA‑based continual‑learning method (e.g., IncLoRA) with minimal code changes.
  • Empirical gains: Across several benchmark CL streams (e.g., GLUE‑CL, continual QA), JumpLoRA + IncLoRA outperforms ELLA by up to 4.2 % absolute accuracy while using ≤ 30 % extra FLOPs.
  • Open‑source implementation: The authors release a PyTorch library that integrates seamlessly with Hugging Face’s peft adapters.

Methodology

  1. Base architecture: Start with a frozen pre‑trained LLM (e.g., LLaMA‑7B) and attach standard LoRA adapters (low‑rank matrices A and B) to the attention and feed‑forward layers.
  2. JumpReLU gate: For each LoRA matrix, a parallel gate vector g (same dimension as the low‑rank dimension) is learned. The forward pass multiplies the LoRA output by JumpReLU(g), where JumpReLU is a piecewise‑linear function that outputs either 0 (gate closed) or a scaled positive value (gate open).
  3. Task‑specific sparsity: When a new task arrives, the gate parameters are re‑initialized and trained only on that task’s data. Because the gate is binary‑like, many rows/columns stay at zero, effectively “turning off” parts of the adapter that belong to previous tasks.
  4. Training regime: The authors adopt a two‑stage schedule—first fine‑tune the gate with a small learning rate while freezing the LoRA weights, then jointly update both. A lightweight L1 regularizer on the gate encourages sparsity.
  5. Integration with CL strategies: Existing CL methods (e.g., IncLoRA) already maintain a separate LoRA per task. JumpLoRA simply adds the gate on top, so the same rehearsal or regularization tricks can be reused.

Results & Findings

Dataset (Continual Setting)Baseline (LoRA)IncLoRAELLA (SOTA)JumpLoRA + IncLoRA
GLUE‑CL (5 tasks)71.4 %74.1 %75.6 %78.8 % (+3.2 % over ELLA)
Continual QA (3 domains)62.7 %65.9 %66.5 %69.3 % (+2.8 % over ELLA)
Sentiment‑Stream (10 tasks)68.2 %70.5 %71.0 %73.1 % (+2.1 % over ELLA)
  • Parameter efficiency: Average sparsity per task reached ≈ 45 % of the LoRA weights, cutting memory usage by ~0.5 GB for a 7B model.
  • Training speed: The extra gating adds < 5 % overhead per epoch, negligible compared to full fine‑tuning.
  • Ablation: Removing the L1 regularizer or using a standard ReLU gate drops performance by ~1.5 %, confirming the importance of the JumpReLU design.

Practical Implications

  • Plug‑and‑play adapters: Developers can retrofit existing LoRA‑based pipelines (e.g., for domain‑specific chatbots) with JumpLoRA to get better task separation without re‑architecting the whole system.
  • Edge deployment: Because the gating yields sparse adapters, the final model footprint fits more comfortably on GPU‑memory‑constrained environments (e.g., inference on a single RTX 3080).
  • Rapid product iteration: Companies that need to roll out new language‑understanding features (sentiment analysis, intent detection) on top of a frozen LLM can now add a new “task adapter” in minutes, with reduced risk of degrading previously shipped capabilities.
  • Continual fine‑tuning services: Cloud providers offering “LLM as a service” can expose a “sparse‑adapter” endpoint, letting customers upload task data and receive a lightweight, isolated adapter bundle that can be swapped in at runtime.

Limitations & Future Work

  • Task similarity handling: JumpLoRA treats each task independently; when tasks are highly related, the hard isolation may forgo beneficial knowledge transfer.
  • Scalability to hundreds of tasks: Although memory stays low per adapter, the cumulative number of gates grows linearly, which could become a management overhead.
  • Evaluation breadth: Experiments focus on classification and QA; applying the method to generation‑heavy tasks (e.g., code synthesis) remains an open question.
  • Future directions: The authors suggest exploring soft gating schedules that allow controlled sharing, and integrating the approach with parameter‑efficient prompting (e.g., prefix‑tuning) for even tighter resource budgets.

Authors

  • Alexandra Dragomir
  • Ioana Pintilie
  • Antonio Barbalau
  • Marius Dragoi
  • Florin Brad
  • Cristian Daniel Paduraru
  • Alexandru Tifrea
  • Elena Burceanu
  • Radu Tudor Ionescu

Paper Information

  • arXiv ID: 2604.16171v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: April 17, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »