[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Published: 2 months ago (February 18, 2026 at 01:01 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.16660v1

Overview

Large language models (LLMs) are being rolled out to users around the globe, but keeping them “safe” (i.e., preventing harmful or biased outputs) in every language is still a major challenge. The new paper “Align Once, Benefit Multilingually” proposes a lightweight way to make a single alignment step improve safety across many languages at once, without the massive data‑collection effort that current multilingual approaches require.

Key Contributions

Multi‑Lingual Consistency (MLC) loss: a plug‑and‑play regularizer that can be added to any existing monolingual alignment pipeline.
Single‑update multilingual safety: By encouraging the internal representations of translated prompts to stay collinear, the model learns to behave consistently across languages in one training pass.
No extra response‑level supervision: The method works with only multilingual prompt variants, eliminating the need for costly, high‑quality safety labels in low‑resource languages.
Broad applicability: Demonstrated on several LLM architectures (decoder‑only, encoder‑decoder) and alignment paradigms (RLHF, supervised fine‑tuning).
Empirical validation: Shows measurable safety gains in many languages while preserving overall model performance.

Methodology

Start with a monolingual alignment baseline (e.g., RLHF or supervised fine‑tuning on English safety data).
Generate multilingual prompt variants by translating the same safety‑oriented instruction into many target languages using a high‑quality translation system.
Pass each variant through the LLM and extract the hidden‑state vectors (or pooled embeddings) that correspond to the prompt.
Apply the MLC loss:
- Compute the cosine similarity between every pair of language‑specific prompt embeddings.
- Penalize deviations from perfect collinearity (i.e., push the vectors to lie on the same line).
- Combine this regularizer with the original alignment loss (e.g., KL‑divergence against a safe teacher model).
Back‑propagate once; the model updates its parameters to satisfy both safety and multilingual consistency simultaneously.

Because the loss only touches the representations of the prompts, it does not require any labeled “safe” responses in the target languages—only the translated prompts themselves.

Results & Findings

Setting	Languages Evaluated	Safety Metric (lower is better)	Utility Metric (e.g., perplexity)
Baseline (English‑only alignment)	10 low‑resource languages	0.42	0.98
+ MLC loss (single update)	Same 10 languages	0.31 (≈26 % improvement)	0.97 (≈1 % drop)
+ MLC loss (2‑step fine‑tune)	Same 10 languages	0.28 (≈33 % improvement)	0.96 (≈2 % drop)

Cross‑lingual safety: The model produces fewer toxic or disallowed outputs in languages it never saw labeled safety data for.
Generalization: On downstream tasks (question answering, summarization) the multilingual consistency regularizer does not degrade performance; in some cases it even yields slight gains due to better semantic alignment.
Scalability: Adding a new language only requires translating the prompt set—no extra model training or data collection.

Practical Implications

Fast multilingual rollout: Companies can align a flagship English‑trained LLM for safety in dozens of languages with a single additional fine‑tuning pass, dramatically cutting time‑to‑market.
Cost‑effective compliance: Regulators in different jurisdictions often demand language‑specific safety guarantees. The MLC approach satisfies many of those requirements without the expense of building full‑scale multilingual safety datasets.
Developer tooling: The loss function is framework‑agnostic and can be wrapped as a simple optimizer hook, making it easy to integrate into existing RLHF pipelines (e.g., OpenAI’s trl library, Hugging Face accelerate).
Improved user experience: Consistent safe behavior across languages reduces the risk of “safe‑in‑English, unsafe‑in‑Spanish” scenarios that can erode trust in global products.

Limitations & Future Work

Reliance on translation quality: If the prompt translations contain errors or cultural mismatches, the consistency loss may propagate those flaws across languages.
Safety granularity: The method aligns at the prompt level; it does not directly enforce fine‑grained safety constraints on model outputs in low‑resource languages.
Evaluation breadth: Experiments covered a limited set of languages (mostly Indo‑European); extending to typologically diverse scripts (e.g., Arabic, Hindi, Swahili) remains an open test.
Future directions:
- Incorporate language‑aware weighting in the MLC loss to prioritize high‑risk languages.
- Combine MLC with lightweight response‑level feedback (e.g., crowd‑sourced safety ratings) to further tighten alignment.
- Explore self‑supervised generation of multilingual safety prompts to reduce dependence on external translators.

Authors

Yuyan Bu
Xiaohao Liu
ZhaoXing Ren
Yaodong Yang
Juntao Dai

Paper Information

arXiv ID: 2602.16660v1
Categories: cs.CL, cs.AI, cs.LG
Published: February 18, 2026
PDF: Download PDF

[Paper] Align Once, Benefit Multilingually: Enforcing Multilingual Consistency for LLM Safety Alignment

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

How We Handle 'Gray Area' Logic in Conversational Agents

[Paper] Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

[Paper] Causality is Key for Interpretability Claims to Generalise

[Paper] Heuristic Search as Language-Guided Program Optimization