[Paper] Fluent Alignment with Disfluent Judges: Post-training for Lower-resource Languages

Published: (December 9, 2025 at 11:31 AM EST)
3 min read
Source: arXiv

Source: arXiv - 2512.08777v1

Overview

The paper introduces a lightweight post‑training technique that lets language models for lower‑resource languages stay fluent even when they are aligned using disfluent reward models. By sidestepping the need for costly native‑speaker instruction data, the authors demonstrate a practical way to improve Norwegian Bokmål models—and, by extension, other under‑represented languages—while preserving natural‑sounding output.

Key Contributions

  • Fluent‑first post‑training: A novel on‑policy alignment method that keeps fluency intact despite noisy (disfluent) reward signals.
  • Zero‑instruction data requirement: The approach works without any human‑written instruction‑tuning data in the target language.
  • Empirical comparison: Benchmarks against two strong baselines—supervised fine‑tuning on machine‑translated data and multilingual fine‑tuning—show the on‑policy method consistently wins on fluency.
  • Human‑centric evaluation: Native‑speaker fluency judgments validate that the model’s output feels natural, not just statistically fluent.
  • Resource‑efficient pipeline: The method leverages existing multilingual LMs and a modest amount of synthetic data, making it viable for languages lacking large corpora.

Methodology

  1. Base Model – Start from a multilingual language model (e.g., mT5 or LLaMA‑based) that already knows the target language to some degree.
  2. Reward Model (RM) – Train a preference RM on English‑centric data where the “good” responses are often disfluent when transferred verbatim to the target language (e.g., literal translations).
  3. On‑policy Post‑training – Instead of directly applying the RM’s scores (which would push the model toward the same disfluencies), the authors generate candidate responses in the target language, score them with the RM, and then reinforce only the fluent ones using a policy‑gradient style update. This loop forces the model to discover fluent alternatives that still satisfy the reward.
  4. Baselines for comparison
    • Supervised FT: Fine‑tune on machine‑translated instruction data.
    • Multilingual FT: Jointly fine‑tune on many languages with the same data.

The key twist is the on‑policy step: the model learns from its own generations rather than from a static, possibly noisy dataset.

Results & Findings

ApproachFluency (native‑speaker rating)Preference Alignment Score
Supervised FT (MT)★★☆☆☆Moderate
Multilingual FT★★☆☆☆Moderate
On‑policy Post‑training (proposed)★★★★☆High
  • The proposed method outperforms both baselines by a large margin on fluency, as judged by native Norwegian speakers.
  • Preference alignment (i.e., satisfying the reward model) remains strong, showing the model does not sacrifice the intended behavior to gain fluency.
  • Ablation experiments confirm that removing the on‑policy loop collapses fluency back to baseline levels, underscoring its necessity.

Practical Implications

  • Rapid localization: Companies can adapt existing multilingual LMs for new markets (e.g., Scandinavian, African, or South‑Asian languages) without waiting for massive native‑speaker datasets.
  • Cost‑effective AI: Eliminates the need for expensive human annotation pipelines; synthetic data plus on‑policy learning is enough to get a usable, fluent assistant.
  • Better user experience: Chatbots, summarizers, or code assistants that sound natural in the user’s language increase adoption and trust.
  • Open‑source community boost: The technique can be packaged as a plug‑and‑play post‑training script, enabling hobbyists and smaller firms to improve language coverage.
  • Compliance & safety: Maintaining fluency while aligning to reward models helps avoid unintentionally “broken” or stilted outputs that could be misinterpreted as low‑quality or biased.

Limitations & Future Work

  • Language scope: The study focuses on Norwegian Bokmål; results may differ for languages with drastically different morphology or script (e.g., Arabic, Hindi).
  • Reward model quality: The approach still depends on a reward model that may encode English‑centric biases; improving multilingual RMs is an open challenge.
  • Scalability of human evaluation: Native‑speaker fluency assessments are costly; automated fluency proxies need further validation.
  • Future directions suggested by the authors include extending the pipeline to truly low‑resource languages with minimal pre‑training data, experimenting with larger base models, and integrating multilingual reward models that understand cultural nuances.

Authors

  • David Samuel
  • Lilja Øvrelid
  • Erik Velldal
  • Andrey Kutuzov

Paper Information

  • arXiv ID: 2512.08777v1
  • Categories: cs.CL, cs.AI
  • Published: December 9, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »