[Paper] Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection

Published: 3 weeks ago (January 14, 2026 at 01:36 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.09684v1

Overview

The paper tackles a practical pain point when fine‑tuning large language models (LLMs) for multiple downstream tasks using Low‑Rank Adaptation (LoRA). While sharing a single LoRA adapter across tasks saves storage and speeds up deployment, the shared parameters can receive conflicting gradient signals, causing negative transfer—the multi‑task model performs worse than a collection of single‑task models. The authors introduce Ortho‑LoRA, a lightweight gradient‑projection technique that disentangles these conflicts directly inside the low‑rank subspace, restoring most of the lost performance without extra compute.

Key Contributions

Ortho‑LoRA algorithm: a novel orthogonal gradient projection method that respects LoRA’s bipartite (low‑rank) structure.
Dynamic conflict resolution: task gradients are projected onto the orthogonal complement of each other on‑the‑fly during training, preventing interference.
Empirical validation: extensive experiments on the GLUE benchmark show Ortho‑LoRA recovers ≈95 % of the performance gap between multi‑task and single‑task fine‑tuning.
Negligible overhead: the projection step adds only a tiny constant cost, keeping training speed comparable to vanilla joint LoRA.
Open‑source implementation (released with the paper) that can be dropped into existing LoRA pipelines with a single line change.

Methodology

Background – LoRA: LoRA injects two low‑rank matrices A (down‑projection) and B (up‑projection) into each linear layer, keeping the original weights frozen. The trainable parameters are the small matrices, dramatically reducing memory.
Problem – Gradient conflict: In multi‑task training, the gradient of the shared LoRA parameters for task i (g_i) can point in a direction that harms task j. Because LoRA’s rank is tiny, there’s little “room” to satisfy all tasks simultaneously.
Orthogonal projection: For each pair of tasks, Ortho‑LoRA computes the component of g_i that lies orthogonal to g_j within the LoRA subspace:

[ \tilde{g}_i = g_i - \frac{g_i^\top g_j}{|g_j|^2} g_j ]

This removes the part of g_i that would directly oppose g_j. The projection is performed per‑step after the usual back‑propagation, using the current mini‑batch’s task gradients.
Bipartite handling: LoRA’s two matrices (A and B) form a bipartite graph. The authors apply the projection separately to each side, preserving the low‑rank factorization while still ensuring orthogonality.
Training loop: The only change to a standard LoRA training script is a call to ortho_project(g_task_gradients) before the optimizer step. All other hyper‑parameters (learning rate, rank, etc.) stay the same.

Results & Findings

Setup	GLUE Avg. Score	Gap to Single‑Task	Recovery %
Single‑Task LoRA (baseline)	84.2	—	—
Joint Multi‑Task LoRA (no fix)	78.5	5.7 pts	0 %
Joint + Gradient Clipping	80.1	4.1 pts	28 %
Ortho‑LoRA	83.6	0.6 pts	≈95 %

Speed: Training time increased by <2 % compared with vanilla joint LoRA.
Memory: No extra parameters; the projection uses temporary buffers that fit in the same GPU memory budget.
Robustness: Gains held across different LoRA ranks (r = 4, 8, 16) and across both encoder‑only (BERT) and decoder‑only (GPT‑2) backbones.

The findings confirm that most of the negative transfer originates from direct gradient opposition, which can be largely eliminated by orthogonalizing updates.

Practical Implications

Deploy‑once, serve‑many: Companies can keep a single LoRA adapter for a suite of NLP services (sentiment analysis, NLI, QA, etc.) without sacrificing per‑task quality.
Reduced storage & CI/CD complexity: Instead of maintaining dozens of task‑specific adapters, a single Ortho‑LoRA file (often < 1 MB) suffices, simplifying versioning and rollout pipelines.
Fast prototyping: Data scientists can add a new task to an existing multi‑task LoRA model, run a few epochs with Ortho‑LoRA, and expect near‑single‑task performance—great for internal tooling or SaaS platforms.
Edge‑device inference: Since the method does not increase model size, the same low‑memory footprint is retained, making multi‑task LLMs viable on constrained hardware (e.g., mobile or IoT).
Compatibility: Ortho‑LoRA works with any LoRA‑compatible library (PEFT, LoRA‑Hub, HuggingFace adapters), so existing codebases need only a tiny wrapper.

Limitations & Future Work

Scalability to many tasks: The current projection is pairwise; with dozens of tasks the orthogonalization cost grows linearly. Approximate or hierarchical projections could be explored.
Assumption of linear conflict: Orthogonal projection removes only the direct opposing component. More complex, non‑linear task interactions may still cause interference.
Benchmarks limited to GLUE: While GLUE is a solid proxy, real‑world multi‑domain workloads (e.g., code generation + dialog) may exhibit different conflict patterns.
Extension beyond LoRA: The authors note that the same principle could be adapted to other parameter‑efficient fine‑tuning methods (Adapter, Prefix‑Tuning), which is left for future investigation.

Bottom line: Ortho‑LoRA offers a pragmatic, almost‑free fix to a long‑standing multi‑task learning issue in the LoRA ecosystem, making it a compelling addition to any developer’s LLM deployment toolkit.

Authors

Ziyu Yang
Guibin Chen
Yuxin Yang
Aoxiong Zeng
Xiangquan Yang

Paper Information

arXiv ID: 2601.09684v1
Categories: cs.LG, cs.AI, cs.CL
Published: January 14, 2026
PDF: Download PDF

[Paper] Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Do explanations generalize across large reasoning models?

[Paper] Building Production-Ready Probes For Gemini

[Paper] The Poisoned Apple Effect: Strategic Manipulation of Mediated Markets via Technology Expansion of AI Agents

[Paper] MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models