[Paper] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

Published: (December 10, 2025 at 01:37 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2512.09910v1

Overview

Continual learning for Neural Machine Translation (NMT) traditionally suffers from two pain points: catastrophic forgetting (the model loses performance on earlier tasks when it learns new ones) and the computational expense of full‑model retraining. The paper Efficient Continual Learning in Neural Machine Translation: A Low‑Rank Adaptation Approach proposes a lean, plug‑and‑play solution based on Low‑Rank Adaptation (LoRA) that keeps the model’s footprint tiny while still matching the quality of full‑parameter fine‑tuning. It also introduces a gradient‑aware regularizer that protects past knowledge, and a “gate‑free mixture of experts” that lets users blend domain/style adapters on the fly.

Key Contributions

  • LoRA‑based fine‑tuning for NMT – Demonstrates that adapting only low‑rank matrices yields translation quality comparable to full‑parameter updates while using < 5 % of the trainable parameters.
  • Interactive, linear combination of LoRA modules – Proposes a calibrated mixture‑of‑experts style mechanism that lets developers or end‑users blend multiple domain/style adapters in real time, without any gating network or extra retraining.
  • Gradient‑weighted regularization for low‑rank updates – Introduces a novel regularizer that penalizes changes to LoRA matrices based on historic gradient magnitudes, effectively mitigating catastrophic forgetting.
  • Extensive empirical validation – Experiments across new language pairs, domain shifts (e.g., medical, legal, conversational), and continual learning scenarios show the approach scales to dozens of tasks with negligible memory overhead.
  • Open‑source implementation – The authors release code and pretrained LoRA adapters, making it straightforward to plug into popular Transformer‑based NMT frameworks (e.g., Fairseq, OpenNMT, Hugging Face Transformers).

Methodology

1. Low‑Rank Decomposition (LoRA)

  • Instead of updating every weight matrix W in the Transformer, the authors factorize the update as ΔW = A·B, where A ∈ ℝ^{d×r} and B ∈ ℝ^{r×d} with a small rank r (typically 4–16).
  • During training only A and B are learned; the original W stays frozen, keeping inference speed unchanged.

2. Adapter Library & Linear Mixing

  • For each new language or domain a separate LoRA adapter (its own A, B) is trained.
  • At inference, a weighted sum of adapters is computed:

[ \Delta W_{\text{mix}} = \sum_{k} \alpha_k (A_k B_k) ]

where the coefficients α_k are user‑controlled or automatically calibrated (e.g., via a small validation set). No gating network is needed, so the mixture is gate‑free and instantly adjustable.

3. Gradient‑Weighted Regularization

  • To protect previously learned tasks, the loss includes a term:

[ \mathcal{L}{\text{reg}} = \sum{k} \lambda_k | G^{\text{hist}}_k \odot (A_k B_k) |_F^2 ]

where G^{hist}_k stores the magnitude of gradients observed when the adapter k was originally trained. Large historic gradients → stronger penalty, discouraging drastic changes to important low‑rank directions.

4. Training Pipeline

  1. Start from a strong multilingual NMT base (e.g., mBART or a Transformer‑big).
  2. For each new task: train a LoRA adapter for a few epochs (often < 2 % of the original training steps).
  3. Optionally fine‑tune the mixing coefficients α on a small validation set for the target domain/style.

Results & Findings

ScenarioBaseline (full‑fine‑tune)LoRA‑onlyLoRA + Reg.BLEU Δ vs. Full
New language (Spanish→German)31.230.931.0–0.2
Domain shift (news → medical)28.528.228.4–0.1
Continual 10‑task sequence27.8 (final)27.127.7–0.1
Parameter overhead100 %3.8 %4.1 %
Inference latency1× (no extra ops)
  • Performance parity: LoRA adapters achieve within 0.2 BLEU of full‑parameter fine‑tuning across all tested languages and domains.
  • Memory efficiency: Adding a new adapter costs only a few megabytes, enabling on‑device or edge deployment of dozens of domain experts.
  • Catastrophic forgetting mitigation: The gradient‑weighted regularizer reduces BLEU drop on earlier tasks from ~1.5 (plain LoRA) to < 0.2 after learning 10 new tasks.
  • Real‑time style control: Users can blend “formal” vs. “colloquial” adapters with a simple slider, instantly shifting translation style without any latency penalty.

Practical Implications

  • Rapid onboarding of new languages/domains – Companies can roll out a new market language by training a tiny LoRA adapter (hours on a single GPU) instead of re‑training the whole NMT system (days/weeks).
  • Edge and mobile translation – Because the base model stays frozen and adapters are tiny, devices can store a single multilingual backbone and download only the needed adapters on demand.
  • Interactive translation services – SaaS platforms can expose UI controls (e.g., “medical tone”, “legal formality”) that adjust α values in real time, offering personalized output without extra server‑side inference passes.
  • Continuous improvement pipelines – Data teams can push incremental updates (new domain data, user feedback) as separate adapters, safely stacking them without risking regression on existing customers.
  • Cost savings – Lower GPU memory usage and fewer training epochs translate into reduced cloud compute bills, especially for large multilingual models with hundreds of language pairs.

Limitations & Future Work

  • Rank selection sensitivity – Choosing the low‑rank dimension r still requires empirical tuning; too low harms quality, too high erodes the parameter‑efficiency advantage.
  • Adapter explosion – While each adapter is small, managing dozens or hundreds of them may become cumbersome; the paper suggests future work on adapter pruning or hierarchical composition.
  • Regularizer hyper‑parameters – The gradient‑weighted penalty coefficient λ needs a validation sweep; automating this could improve usability.
  • Evaluation on truly low‑resource languages – Experiments focus on medium‑resource pairs; extending to languages with < 10k parallel sentences will test the limits of LoRA’s data efficiency.
  • Broader architectural compatibility – The study concentrates on standard Transformer NMT; adapting the approach to newer architectures (e.g., Retrieval‑augmented models or LLM‑based translators) remains an open avenue.

Authors

  • Salvador Carrión
  • Francisco Casacuberta

Paper Information

  • arXiv ID: 2512.09910v1
  • Categories: cs.CL, cs.AI
  • Published: December 10, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »