[Paper] Fabricator or dynamic translator?

Published: (April 16, 2026 at 11:45 AM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.15165v1

Overview

The paper Fabricator or dynamic translator? investigates why large language models (LLMs) sometimes “over‑generate” when used for machine translation. Unlike classic neural‑machine‑translation (NMT) systems that mainly produce garbled output (neuro‑babble), LLMs can add explanations, hallucinate facts, or even enrich translations in ways a human translator might. Understanding and classifying these behaviors is crucial for deploying LLM‑based translators in real‑world products.

Key Contributions

  • Taxonomy of LLM over‑generation – defines three distinct phenomena:
    1. Self‑explanations – the model adds glosses or context it invents.
    2. Risky confabulations – fabricated content that may be factually wrong.
    3. Appropriate explanations – useful, human‑like clarifications that aid comprehension.
  • Detection pipeline – proposes a lightweight, multi‑stage strategy (prompt‑based probing + classifier) to automatically flag each type of over‑generation.
  • Commercial‑grade evaluation – runs the pipeline on a production‑scale translation service, reporting precision/recall for each class.
  • Guidelines for mitigation – offers practical rules (prompt engineering, post‑editing filters) to curb harmful hallucinations while preserving helpful explanations.

Methodology

  1. Data collection – the authors sampled 10 k sentence pairs from a live translation API that uses a state‑of‑the‑art LLM. Human annotators labeled each output as “clean translation,” “self‑explanation,” “confabulation,” or “useful explanation.”
  2. Prompt‑based probing – they crafted a set of diagnostic prompts (e.g., “Did you add any information not present in the source?”) that the LLM answers about its own output. The responses feed into a simple binary classifier.
  3. Feature‑rich classifier – combines the probing answers with surface features (length ratio, presence of parenthetical clauses, lexical novelty) and trains a lightweight gradient‑boosted tree model to predict the over‑generation class.
  4. Iterative refinement – false positives are examined, prompting adjustments (e.g., stricter temperature settings, “no‑explain” system messages) and re‑training.

The pipeline is deliberately low‑overhead so it can run in‑line with the translation service without adding noticeable latency.

Results & Findings

PhenomenonPrecisionRecall
Self‑explanations0.840.71
Risky confabulations0.780.66
Appropriate explanations0.910.79
  • Self‑explanations are the most common (≈22 % of outputs) and are usually benign, but they can inflate translation length and affect downstream UI layout.
  • Risky confabulations occur in ~5 % of cases; they often involve invented named entities or dates, posing real misinformation risks.
  • Appropriate explanations appear in ~9 % of outputs and are positively correlated with higher user satisfaction scores in A/B tests.

Applying the mitigation rules (lower temperature, explicit “translate‑only” prompts) cut risky confabulations by 38 % while preserving 85 % of the useful explanations.

Practical Implications

  • Product teams can integrate the detection pipeline to automatically flag or strip harmful hallucinations before presenting translations to end‑users.
  • Prompt engineers gain concrete patterns (e.g., “Explain only if needed”) that balance fidelity and helpfulness, reducing the need for costly human post‑editing.
  • UX designers can decide whether to surface appropriate explanations as tooltips or inline notes, turning a potential “bug” into a feature that boosts comprehension for non‑native speakers.
  • Compliance & safety – the classifier provides an audit trail for regulatory environments where fabricated content is unacceptable (e.g., medical or legal translation).

Limitations & Future Work

  • The study focuses on a single commercial LLM and a specific language pair; cross‑lingual generalization remains untested.
  • The probing prompts rely on the model’s self‑awareness, which can be unreliable for very low‑resource languages.
  • Future research directions include: expanding the taxonomy to cover multimodal inputs, training a dedicated “hallucination‑aware” translation model, and exploring reinforcement‑learning‑based fine‑tuning to suppress risky confabulations while encouraging helpful explanations.

Authors

  • Lisa Vasileva
  • Karin Sim

Paper Information

  • arXiv ID: 2604.15165v1
  • Categories: cs.CL
  • Published: April 16, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »