[Paper] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach
Source: arXiv - 2512.09910v1
Overview
Continual learning for Neural Machine Translation (NMT) traditionally suffers from two pain points: catastrophic forgetting (the model loses performance on earlier tasks when it learns new ones) and the computational expense of full‑model retraining. The paper Efficient Continual Learning in Neural Machine Translation: A Low‑Rank Adaptation Approach proposes a lean, plug‑and‑play solution based on Low‑Rank Adaptation (LoRA) that keeps the model’s footprint tiny while still matching the quality of full‑parameter fine‑tuning. It also introduces a gradient‑aware regularizer that protects past knowledge, and a “gate‑free mixture of experts” that lets users blend domain/style adapters on the fly.
Key Contributions
- LoRA‑based fine‑tuning for NMT – Demonstrates that adapting only low‑rank matrices yields translation quality comparable to full‑parameter updates while using < 5 % of the trainable parameters.
- Interactive, linear combination of LoRA modules – Proposes a calibrated mixture‑of‑experts style mechanism that lets developers or end‑users blend multiple domain/style adapters in real time, without any gating network or extra retraining.
- Gradient‑weighted regularization for low‑rank updates – Introduces a novel regularizer that penalizes changes to LoRA matrices based on historic gradient magnitudes, effectively mitigating catastrophic forgetting.
- Extensive empirical validation – Experiments across new language pairs, domain shifts (e.g., medical, legal, conversational), and continual learning scenarios show the approach scales to dozens of tasks with negligible memory overhead.
- Open‑source implementation – The authors release code and pretrained LoRA adapters, making it straightforward to plug into popular Transformer‑based NMT frameworks (e.g., Fairseq, OpenNMT, Hugging Face Transformers).
Methodology
1. Low‑Rank Decomposition (LoRA)
- Instead of updating every weight matrix W in the Transformer, the authors factorize the update as ΔW = A·B, where A ∈ ℝ^{d×r} and B ∈ ℝ^{r×d} with a small rank r (typically 4–16).
- During training only A and B are learned; the original W stays frozen, keeping inference speed unchanged.
2. Adapter Library & Linear Mixing
- For each new language or domain a separate LoRA adapter (its own A, B) is trained.
- At inference, a weighted sum of adapters is computed:
[ \Delta W_{\text{mix}} = \sum_{k} \alpha_k (A_k B_k) ]
where the coefficients α_k are user‑controlled or automatically calibrated (e.g., via a small validation set). No gating network is needed, so the mixture is gate‑free and instantly adjustable.
3. Gradient‑Weighted Regularization
- To protect previously learned tasks, the loss includes a term:
[ \mathcal{L}{\text{reg}} = \sum{k} \lambda_k | G^{\text{hist}}_k \odot (A_k B_k) |_F^2 ]
where G^{hist}_k stores the magnitude of gradients observed when the adapter k was originally trained. Large historic gradients → stronger penalty, discouraging drastic changes to important low‑rank directions.
4. Training Pipeline
- Start from a strong multilingual NMT base (e.g., mBART or a Transformer‑big).
- For each new task: train a LoRA adapter for a few epochs (often < 2 % of the original training steps).
- Optionally fine‑tune the mixing coefficients α on a small validation set for the target domain/style.
Results & Findings
| Scenario | Baseline (full‑fine‑tune) | LoRA‑only | LoRA + Reg. | BLEU Δ vs. Full |
|---|---|---|---|---|
| New language (Spanish→German) | 31.2 | 30.9 | 31.0 | –0.2 |
| Domain shift (news → medical) | 28.5 | 28.2 | 28.4 | –0.1 |
| Continual 10‑task sequence | 27.8 (final) | 27.1 | 27.7 | –0.1 |
| Parameter overhead | 100 % | 3.8 % | 4.1 % | — |
| Inference latency | 1× | 1× (no extra ops) | 1× | — |
- Performance parity: LoRA adapters achieve within 0.2 BLEU of full‑parameter fine‑tuning across all tested languages and domains.
- Memory efficiency: Adding a new adapter costs only a few megabytes, enabling on‑device or edge deployment of dozens of domain experts.
- Catastrophic forgetting mitigation: The gradient‑weighted regularizer reduces BLEU drop on earlier tasks from ~1.5 (plain LoRA) to < 0.2 after learning 10 new tasks.
- Real‑time style control: Users can blend “formal” vs. “colloquial” adapters with a simple slider, instantly shifting translation style without any latency penalty.
Practical Implications
- Rapid onboarding of new languages/domains – Companies can roll out a new market language by training a tiny LoRA adapter (hours on a single GPU) instead of re‑training the whole NMT system (days/weeks).
- Edge and mobile translation – Because the base model stays frozen and adapters are tiny, devices can store a single multilingual backbone and download only the needed adapters on demand.
- Interactive translation services – SaaS platforms can expose UI controls (e.g., “medical tone”, “legal formality”) that adjust α values in real time, offering personalized output without extra server‑side inference passes.
- Continuous improvement pipelines – Data teams can push incremental updates (new domain data, user feedback) as separate adapters, safely stacking them without risking regression on existing customers.
- Cost savings – Lower GPU memory usage and fewer training epochs translate into reduced cloud compute bills, especially for large multilingual models with hundreds of language pairs.
Limitations & Future Work
- Rank selection sensitivity – Choosing the low‑rank dimension r still requires empirical tuning; too low harms quality, too high erodes the parameter‑efficiency advantage.
- Adapter explosion – While each adapter is small, managing dozens or hundreds of them may become cumbersome; the paper suggests future work on adapter pruning or hierarchical composition.
- Regularizer hyper‑parameters – The gradient‑weighted penalty coefficient λ needs a validation sweep; automating this could improve usability.
- Evaluation on truly low‑resource languages – Experiments focus on medium‑resource pairs; extending to languages with < 10k parallel sentences will test the limits of LoRA’s data efficiency.
- Broader architectural compatibility – The study concentrates on standard Transformer NMT; adapting the approach to newer architectures (e.g., Retrieval‑augmented models or LLM‑based translators) remains an open avenue.
Authors
- Salvador Carrión
- Francisco Casacuberta
Paper Information
- arXiv ID: 2512.09910v1
- Categories: cs.CL, cs.AI
- Published: December 10, 2025
- PDF: Download PDF