[Paper] Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach

Published: 15 hours ago (December 10, 2025 at 01:37 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.09910v1

Overview

Continual learning for Neural Machine Translation (NMT) traditionally suffers from two pain points: catastrophic forgetting (the model loses performance on earlier tasks when it learns new ones) and the computational expense of full‑model retraining. The paper Efficient Continual Learning in Neural Machine Translation: A Low‑Rank Adaptation Approach proposes a lean, plug‑and‑play solution based on Low‑Rank Adaptation (LoRA) that keeps the model’s footprint tiny while still matching the quality of full‑parameter fine‑tuning. It also introduces a gradient‑aware regularizer that protects past knowledge, and a “gate‑free mixture of experts” that lets users blend domain/style adapters on the fly.

Key Contributions

LoRA‑based fine‑tuning for NMT – Demonstrates that adapting only low‑rank matrices yields translation quality comparable to full‑parameter updates while using < 5 % of the trainable parameters.
Interactive, linear combination of LoRA modules – Proposes a calibrated mixture‑of‑experts style mechanism that lets developers or end‑users blend multiple domain/style adapters in real time, without any gating network or extra retraining.
Gradient‑weighted regularization for low‑rank updates – Introduces a novel regularizer that penalizes changes to LoRA matrices based on historic gradient magnitudes, effectively mitigating catastrophic forgetting.
Extensive empirical validation – Experiments across new language pairs, domain shifts (e.g., medical, legal, conversational), and continual learning scenarios show the approach scales to dozens of tasks with negligible memory overhead.
Open‑source implementation – The authors release code and pretrained LoRA adapters, making it straightforward to plug into popular Transformer‑based NMT frameworks (e.g., Fairseq, OpenNMT, Hugging Face Transformers).

Methodology

1. Low‑Rank Decomposition (LoRA)

Instead of updating every weight matrix W in the Transformer, the authors factorize the update as ΔW = A·B, where A ∈ ℝ^{d×r} and B ∈ ℝ^{r×d} with a small rank r (typically 4–16).
During training only A and B are learned; the original W stays frozen, keeping inference speed unchanged.

2. Adapter Library & Linear Mixing

For each new language or domain a separate LoRA adapter (its own A, B) is trained.
At inference, a weighted sum of adapters is computed:

[ \Delta W_{\text{mix}} = \sum_{k} \alpha_k (A_k B_k) ]

where the coefficients α_k are user‑controlled or automatically calibrated (e.g., via a small validation set). No gating network is needed, so the mixture is gate‑free and instantly adjustable.

3. Gradient‑Weighted Regularization

To protect previously learned tasks, the loss includes a term:

[ \mathcal{L}{\text{reg}} = \sum{k} \lambda_k | G^{\text{hist}}_k \odot (A_k B_k) |_F^2 ]

where G^{hist}_k stores the magnitude of gradients observed when the adapter k was originally trained. Large historic gradients → stronger penalty, discouraging drastic changes to important low‑rank directions.

4. Training Pipeline

Start from a strong multilingual NMT base (e.g., mBART or a Transformer‑big).
For each new task: train a LoRA adapter for a few epochs (often < 2 % of the original training steps).
Optionally fine‑tune the mixing coefficients α on a small validation set for the target domain/style.

Results & Findings

Scenario	Baseline (full‑fine‑tune)	LoRA‑only	LoRA + Reg.	BLEU Δ vs. Full
New language (Spanish→German)	31.2	30.9	31.0	–0.2
Domain shift (news → medical)	28.5	28.2	28.4	–0.1
Continual 10‑task sequence	27.8 (final)	27.1	27.7	–0.1
Parameter overhead	100 %	3.8 %	4.1 %	—
Inference latency	1×	1× (no extra ops)	1×	—

Performance parity: LoRA adapters achieve within 0.2 BLEU of full‑parameter fine‑tuning across all tested languages and domains.
Memory efficiency: Adding a new adapter costs only a few megabytes, enabling on‑device or edge deployment of dozens of domain experts.
Catastrophic forgetting mitigation: The gradient‑weighted regularizer reduces BLEU drop on earlier tasks from ~1.5 (plain LoRA) to < 0.2 after learning 10 new tasks.
Real‑time style control: Users can blend “formal” vs. “colloquial” adapters with a simple slider, instantly shifting translation style without any latency penalty.

Practical Implications

Rapid onboarding of new languages/domains – Companies can roll out a new market language by training a tiny LoRA adapter (hours on a single GPU) instead of re‑training the whole NMT system (days/weeks).
Edge and mobile translation – Because the base model stays frozen and adapters are tiny, devices can store a single multilingual backbone and download only the needed adapters on demand.
Interactive translation services – SaaS platforms can expose UI controls (e.g., “medical tone”, “legal formality”) that adjust α values in real time, offering personalized output without extra server‑side inference passes.
Continuous improvement pipelines – Data teams can push incremental updates (new domain data, user feedback) as separate adapters, safely stacking them without risking regression on existing customers.
Cost savings – Lower GPU memory usage and fewer training epochs translate into reduced cloud compute bills, especially for large multilingual models with hundreds of language pairs.

Limitations & Future Work

Rank selection sensitivity – Choosing the low‑rank dimension r still requires empirical tuning; too low harms quality, too high erodes the parameter‑efficiency advantage.
Adapter explosion – While each adapter is small, managing dozens or hundreds of them may become cumbersome; the paper suggests future work on adapter pruning or hierarchical composition.
Regularizer hyper‑parameters – The gradient‑weighted penalty coefficient λ needs a validation sweep; automating this could improve usability.
Evaluation on truly low‑resource languages – Experiments focus on medium‑resource pairs; extending to languages with < 10k parallel sentences will test the limits of LoRA’s data efficiency.
Broader architectural compatibility – The study concentrates on standard Transformer NMT; adapting the approach to newer architectures (e.g., Retrieval‑augmented models or LLM‑based translators) remains an open avenue.

Authors

Salvador Carrión
Francisco Casacuberta

Paper Information

arXiv ID: 2512.09910v1
Categories: cs.CL, cs.AI
Published: December 10, 2025
PDF: Download PDF