[Paper] Disentangling Task Conflicts in Multi-Task LoRA via Orthogonal Gradient Projection

Published: (January 14, 2026 at 01:36 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.09684v1

Overview

The paper tackles a practical pain point when fine‑tuning large language models (LLMs) for multiple downstream tasks using Low‑Rank Adaptation (LoRA). While sharing a single LoRA adapter across tasks saves storage and speeds up deployment, the shared parameters can receive conflicting gradient signals, causing negative transfer—the multi‑task model performs worse than a collection of single‑task models. The authors introduce Ortho‑LoRA, a lightweight gradient‑projection technique that disentangles these conflicts directly inside the low‑rank subspace, restoring most of the lost performance without extra compute.

Key Contributions

  • Ortho‑LoRA algorithm: a novel orthogonal gradient projection method that respects LoRA’s bipartite (low‑rank) structure.
  • Dynamic conflict resolution: task gradients are projected onto the orthogonal complement of each other on‑the‑fly during training, preventing interference.
  • Empirical validation: extensive experiments on the GLUE benchmark show Ortho‑LoRA recovers ≈95 % of the performance gap between multi‑task and single‑task fine‑tuning.
  • Negligible overhead: the projection step adds only a tiny constant cost, keeping training speed comparable to vanilla joint LoRA.
  • Open‑source implementation (released with the paper) that can be dropped into existing LoRA pipelines with a single line change.

Methodology

  1. Background – LoRA: LoRA injects two low‑rank matrices A (down‑projection) and B (up‑projection) into each linear layer, keeping the original weights frozen. The trainable parameters are the small matrices, dramatically reducing memory.

  2. Problem – Gradient conflict: In multi‑task training, the gradient of the shared LoRA parameters for task i (g_i) can point in a direction that harms task j. Because LoRA’s rank is tiny, there’s little “room” to satisfy all tasks simultaneously.

  3. Orthogonal projection: For each pair of tasks, Ortho‑LoRA computes the component of g_i that lies orthogonal to g_j within the LoRA subspace:

    [ \tilde{g}_i = g_i - \frac{g_i^\top g_j}{|g_j|^2} g_j ]

    This removes the part of g_i that would directly oppose g_j. The projection is performed per‑step after the usual back‑propagation, using the current mini‑batch’s task gradients.

  4. Bipartite handling: LoRA’s two matrices (A and B) form a bipartite graph. The authors apply the projection separately to each side, preserving the low‑rank factorization while still ensuring orthogonality.

  5. Training loop: The only change to a standard LoRA training script is a call to ortho_project(g_task_gradients) before the optimizer step. All other hyper‑parameters (learning rate, rank, etc.) stay the same.

Results & Findings

SetupGLUE Avg. ScoreGap to Single‑TaskRecovery %
Single‑Task LoRA (baseline)84.2
Joint Multi‑Task LoRA (no fix)78.55.7 pts0 %
Joint + Gradient Clipping80.14.1 pts28 %
Ortho‑LoRA83.60.6 pts≈95 %
  • Speed: Training time increased by <2 % compared with vanilla joint LoRA.
  • Memory: No extra parameters; the projection uses temporary buffers that fit in the same GPU memory budget.
  • Robustness: Gains held across different LoRA ranks (r = 4, 8, 16) and across both encoder‑only (BERT) and decoder‑only (GPT‑2) backbones.

The findings confirm that most of the negative transfer originates from direct gradient opposition, which can be largely eliminated by orthogonalizing updates.

Practical Implications

  • Deploy‑once, serve‑many: Companies can keep a single LoRA adapter for a suite of NLP services (sentiment analysis, NLI, QA, etc.) without sacrificing per‑task quality.
  • Reduced storage & CI/CD complexity: Instead of maintaining dozens of task‑specific adapters, a single Ortho‑LoRA file (often < 1 MB) suffices, simplifying versioning and rollout pipelines.
  • Fast prototyping: Data scientists can add a new task to an existing multi‑task LoRA model, run a few epochs with Ortho‑LoRA, and expect near‑single‑task performance—great for internal tooling or SaaS platforms.
  • Edge‑device inference: Since the method does not increase model size, the same low‑memory footprint is retained, making multi‑task LLMs viable on constrained hardware (e.g., mobile or IoT).
  • Compatibility: Ortho‑LoRA works with any LoRA‑compatible library (PEFT, LoRA‑Hub, HuggingFace adapters), so existing codebases need only a tiny wrapper.

Limitations & Future Work

  • Scalability to many tasks: The current projection is pairwise; with dozens of tasks the orthogonalization cost grows linearly. Approximate or hierarchical projections could be explored.
  • Assumption of linear conflict: Orthogonal projection removes only the direct opposing component. More complex, non‑linear task interactions may still cause interference.
  • Benchmarks limited to GLUE: While GLUE is a solid proxy, real‑world multi‑domain workloads (e.g., code generation + dialog) may exhibit different conflict patterns.
  • Extension beyond LoRA: The authors note that the same principle could be adapted to other parameter‑efficient fine‑tuning methods (Adapter, Prefix‑Tuning), which is left for future investigation.

Bottom line: Ortho‑LoRA offers a pragmatic, almost‑free fix to a long‑standing multi‑task learning issue in the LoRA ecosystem, making it a compelling addition to any developer’s LLM deployment toolkit.

Authors

  • Ziyu Yang
  • Guibin Chen
  • Yuxin Yang
  • Aoxiong Zeng
  • Xiangquan Yang

Paper Information

  • arXiv ID: 2601.09684v1
  • Categories: cs.LG, cs.AI, cs.CL
  • Published: January 14, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »