[Paper] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Published: 3 weeks ago (April 17, 2026 at 11:38 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.16171v1

Overview

The paper “JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models” introduces a lightweight way to make adapters—tiny trainable modules added to frozen LLMs—both sparser and more isolated across tasks. By inserting a simple gating function (JumpReLU) into LoRA blocks, the authors achieve dynamic sparsity that curbs catastrophic forgetting while keeping the computational budget low. The result is a plug‑and‑play upgrade that lifts the performance of existing LoRA‑based continual‑learning pipelines and even beats the current state‑of‑the‑art method ELLA.

Key Contributions

JumpReLU gating: A novel, train‑time gating mechanism that selectively deactivates rows/columns in LoRA matrices, inducing task‑specific sparsity on the fly.
Dynamic parameter isolation: The gating creates “islands” of active parameters per task, reducing interference without needing explicit subspace constraints.
Modular compatibility: JumpLoRA can be stacked on top of any LoRA‑based continual‑learning method (e.g., IncLoRA) with minimal code changes.
Empirical gains: Across several benchmark CL streams (e.g., GLUE‑CL, continual QA), JumpLoRA + IncLoRA outperforms ELLA by up to 4.2 % absolute accuracy while using ≤ 30 % extra FLOPs.
Open‑source implementation: The authors release a PyTorch library that integrates seamlessly with Hugging Face’s peft adapters.

Methodology

Base architecture: Start with a frozen pre‑trained LLM (e.g., LLaMA‑7B) and attach standard LoRA adapters (low‑rank matrices A and B) to the attention and feed‑forward layers.
JumpReLU gate: For each LoRA matrix, a parallel gate vector g (same dimension as the low‑rank dimension) is learned. The forward pass multiplies the LoRA output by JumpReLU(g), where JumpReLU is a piecewise‑linear function that outputs either 0 (gate closed) or a scaled positive value (gate open).
Task‑specific sparsity: When a new task arrives, the gate parameters are re‑initialized and trained only on that task’s data. Because the gate is binary‑like, many rows/columns stay at zero, effectively “turning off” parts of the adapter that belong to previous tasks.
Training regime: The authors adopt a two‑stage schedule—first fine‑tune the gate with a small learning rate while freezing the LoRA weights, then jointly update both. A lightweight L1 regularizer on the gate encourages sparsity.
Integration with CL strategies: Existing CL methods (e.g., IncLoRA) already maintain a separate LoRA per task. JumpLoRA simply adds the gate on top, so the same rehearsal or regularization tricks can be reused.

Results & Findings

Dataset (Continual Setting)	Baseline (LoRA)	IncLoRA	ELLA (SOTA)	JumpLoRA + IncLoRA
GLUE‑CL (5 tasks)	71.4 %	74.1 %	75.6 %	78.8 % (+3.2 % over ELLA)
Continual QA (3 domains)	62.7 %	65.9 %	66.5 %	69.3 % (+2.8 % over ELLA)
Sentiment‑Stream (10 tasks)	68.2 %	70.5 %	71.0 %	73.1 % (+2.1 % over ELLA)

Parameter efficiency: Average sparsity per task reached ≈ 45 % of the LoRA weights, cutting memory usage by ~0.5 GB for a 7B model.
Training speed: The extra gating adds < 5 % overhead per epoch, negligible compared to full fine‑tuning.
Ablation: Removing the L1 regularizer or using a standard ReLU gate drops performance by ~1.5 %, confirming the importance of the JumpReLU design.

Practical Implications

Plug‑and‑play adapters: Developers can retrofit existing LoRA‑based pipelines (e.g., for domain‑specific chatbots) with JumpLoRA to get better task separation without re‑architecting the whole system.
Edge deployment: Because the gating yields sparse adapters, the final model footprint fits more comfortably on GPU‑memory‑constrained environments (e.g., inference on a single RTX 3080).
Rapid product iteration: Companies that need to roll out new language‑understanding features (sentiment analysis, intent detection) on top of a frozen LLM can now add a new “task adapter” in minutes, with reduced risk of degrading previously shipped capabilities.
Continual fine‑tuning services: Cloud providers offering “LLM as a service” can expose a “sparse‑adapter” endpoint, letting customers upload task data and receive a lightweight, isolated adapter bundle that can be swapped in at runtime.

Limitations & Future Work

Task similarity handling: JumpLoRA treats each task independently; when tasks are highly related, the hard isolation may forgo beneficial knowledge transfer.
Scalability to hundreds of tasks: Although memory stays low per adapter, the cumulative number of gates grows linearly, which could become a management overhead.
Evaluation breadth: Experiments focus on classification and QA; applying the method to generation‑heavy tasks (e.g., code synthesis) remains an open question.
Future directions: The authors suggest exploring soft gating schedules that allow controlled sharing, and integrating the approach with parameter‑efficient prompting (e.g., prefix‑tuning) for even tighter resource budgets.

Authors

Alexandra Dragomir
Ioana Pintilie
Antonio Barbalau
Marius Dragoi
Florin Brad
Cristian Daniel Paduraru
Alexandru Tifrea
Elena Burceanu
Radu Tudor Ionescu

Paper Information

arXiv ID: 2604.16171v1
Categories: cs.LG, cs.AI, cs.CL
Published: April 17, 2026
PDF: Download PDF

[Paper] JumpLoRA: Sparse Adapters for Continual Learning in Large Language Models

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text

[Paper] Detecting and Suppressing Reward Hacking with Gradient Fingerprints