[Paper] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Published: 2 months ago (December 9, 2025 at 11:31 AM EST)

3 min read

Source: arXiv

Source: arXiv - 2512.08777v1

Overview

The paper introduces a lightweight post‑training technique that lets language models for lower‑resource languages stay fluent even when they are aligned using disfluent reward models. By sidestepping the need for costly native‑speaker instruction data, the authors demonstrate a practical way to improve Norwegian Bokmål models—and, by extension, other under‑represented languages—while preserving natural‑sounding output.

Key Contributions

Fluent‑first post‑training: A novel on‑policy alignment method that keeps fluency intact despite noisy (disfluent) reward signals.
Zero‑instruction data requirement: The approach works without any human‑written instruction‑tuning data in the target language.
Empirical comparison: Benchmarks against two strong baselines—supervised fine‑tuning on machine‑translated data and multilingual fine‑tuning—show the on‑policy method consistently wins on fluency.
Human‑centric evaluation: Native‑speaker fluency judgments validate that the model’s output feels natural, not just statistically fluent.
Resource‑efficient pipeline: The method leverages existing multilingual LMs and a modest amount of synthetic data, making it viable for languages lacking large corpora.

Methodology

Base Model – Start from a multilingual language model (e.g., mT5 or LLaMA‑based) that already knows the target language to some degree.
Reward Model (RM) – Train a preference RM on English‑centric data where the “good” responses are often disfluent when transferred verbatim to the target language (e.g., literal translations).
On‑policy Post‑training – Instead of directly applying the RM’s scores (which would push the model toward the same disfluencies), the authors generate candidate responses in the target language, score them with the RM, and then reinforce only the fluent ones using a policy‑gradient style update. This loop forces the model to discover fluent alternatives that still satisfy the reward.
Baselines for comparison
- Supervised FT: Fine‑tune on machine‑translated instruction data.
- Multilingual FT: Jointly fine‑tune on many languages with the same data.

The key twist is the on‑policy step: the model learns from its own generations rather than from a static, possibly noisy dataset.

Results & Findings

Approach	Fluency (native‑speaker rating)	Preference Alignment Score
Supervised FT (MT)	★★☆☆☆	Moderate
Multilingual FT	★★☆☆☆	Moderate
On‑policy Post‑training (proposed)	★★★★☆	High

The proposed method outperforms both baselines by a large margin on fluency, as judged by native Norwegian speakers.
Preference alignment (i.e., satisfying the reward model) remains strong, showing the model does not sacrifice the intended behavior to gain fluency.
Ablation experiments confirm that removing the on‑policy loop collapses fluency back to baseline levels, underscoring its necessity.

Practical Implications

Rapid localization: Companies can adapt existing multilingual LMs for new markets (e.g., Scandinavian, African, or South‑Asian languages) without waiting for massive native‑speaker datasets.
Cost‑effective AI: Eliminates the need for expensive human annotation pipelines; synthetic data plus on‑policy learning is enough to get a usable, fluent assistant.
Better user experience: Chatbots, summarizers, or code assistants that sound natural in the user’s language increase adoption and trust.
Open‑source community boost: The technique can be packaged as a plug‑and‑play post‑training script, enabling hobbyists and smaller firms to improve language coverage.
Compliance & safety: Maintaining fluency while aligning to reward models helps avoid unintentionally “broken” or stilted outputs that could be misinterpreted as low‑quality or biased.

Limitations & Future Work

Language scope: The study focuses on Norwegian Bokmål; results may differ for languages with drastically different morphology or script (e.g., Arabic, Hindi).
Reward model quality: The approach still depends on a reward model that may encode English‑centric biases; improving multilingual RMs is an open challenge.
Scalability of human evaluation: Native‑speaker fluency assessments are costly; automated fluency proxies need further validation.
Future directions suggested by the authors include extending the pipeline to truly low‑resource languages with minimal pre‑training data, experimenting with larger base models, and integrating multilingual reward models that understand cultural nuances.

Authors

David Samuel
Lilja Øvrelid
Erik Velldal
Andrey Kutuzov

Paper Information

arXiv ID: 2512.08777v1
Categories: cs.CL, cs.AI
Published: December 9, 2025
PDF: Download PDF

[Paper] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] From Signal to Turn: Interactional Friction in Modular Speech-to-Speech Pipelines

[Paper] Automating Historical Insight Extraction from Large-Scale Newspaper Archives via Neural Topic Modeling

[Paper] Bounding Hallucinations: Information-Theoretic Guarantees for RAG Systems via Merlin-Arthur Protocols

[Paper] Visualizing token importance for black-box language models