[Paper] Fabricator or dynamic translator?

Published: 3 weeks ago (April 16, 2026 at 11:45 AM EDT)

4 min read

Source: arXiv

Source: arXiv - 2604.15165v1

Overview

The paper Fabricator or dynamic translator? investigates why large language models (LLMs) sometimes “over‑generate” when used for machine translation. Unlike classic neural‑machine‑translation (NMT) systems that mainly produce garbled output (neuro‑babble), LLMs can add explanations, hallucinate facts, or even enrich translations in ways a human translator might. Understanding and classifying these behaviors is crucial for deploying LLM‑based translators in real‑world products.

Key Contributions

Taxonomy of LLM over‑generation – defines three distinct phenomena:
1. Self‑explanations – the model adds glosses or context it invents.
2. Risky confabulations – fabricated content that may be factually wrong.
3. Appropriate explanations – useful, human‑like clarifications that aid comprehension.
Detection pipeline – proposes a lightweight, multi‑stage strategy (prompt‑based probing + classifier) to automatically flag each type of over‑generation.
Commercial‑grade evaluation – runs the pipeline on a production‑scale translation service, reporting precision/recall for each class.
Guidelines for mitigation – offers practical rules (prompt engineering, post‑editing filters) to curb harmful hallucinations while preserving helpful explanations.

Methodology

Data collection – the authors sampled 10 k sentence pairs from a live translation API that uses a state‑of‑the‑art LLM. Human annotators labeled each output as “clean translation,” “self‑explanation,” “confabulation,” or “useful explanation.”
Prompt‑based probing – they crafted a set of diagnostic prompts (e.g., “Did you add any information not present in the source?”) that the LLM answers about its own output. The responses feed into a simple binary classifier.
Feature‑rich classifier – combines the probing answers with surface features (length ratio, presence of parenthetical clauses, lexical novelty) and trains a lightweight gradient‑boosted tree model to predict the over‑generation class.
Iterative refinement – false positives are examined, prompting adjustments (e.g., stricter temperature settings, “no‑explain” system messages) and re‑training.

The pipeline is deliberately low‑overhead so it can run in‑line with the translation service without adding noticeable latency.

Results & Findings

Phenomenon	Precision	Recall
Self‑explanations	0.84	0.71
Risky confabulations	0.78	0.66
Appropriate explanations	0.91	0.79

Self‑explanations are the most common (≈22 % of outputs) and are usually benign, but they can inflate translation length and affect downstream UI layout.
Risky confabulations occur in ~5 % of cases; they often involve invented named entities or dates, posing real misinformation risks.
Appropriate explanations appear in ~9 % of outputs and are positively correlated with higher user satisfaction scores in A/B tests.

Applying the mitigation rules (lower temperature, explicit “translate‑only” prompts) cut risky confabulations by 38 % while preserving 85 % of the useful explanations.

Practical Implications

Product teams can integrate the detection pipeline to automatically flag or strip harmful hallucinations before presenting translations to end‑users.
Prompt engineers gain concrete patterns (e.g., “Explain only if needed”) that balance fidelity and helpfulness, reducing the need for costly human post‑editing.
UX designers can decide whether to surface appropriate explanations as tooltips or inline notes, turning a potential “bug” into a feature that boosts comprehension for non‑native speakers.
Compliance & safety – the classifier provides an audit trail for regulatory environments where fabricated content is unacceptable (e.g., medical or legal translation).

Limitations & Future Work

The study focuses on a single commercial LLM and a specific language pair; cross‑lingual generalization remains untested.
The probing prompts rely on the model’s self‑awareness, which can be unreliable for very low‑resource languages.
Future research directions include: expanding the taxonomy to cover multimodal inputs, training a dedicated “hallucination‑aware” translation model, and exploring reinforcement‑learning‑based fine‑tuning to suppress risky confabulations while encouraging helpful explanations.

Authors

Lisa Vasileva
Karin Sim

Paper Information

arXiv ID: 2604.15165v1
Categories: cs.CL
Published: April 16, 2026
PDF: Download PDF

[Paper] Fabricator or dynamic translator?

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Learning to Reason with Insight for Informal Theorem Proving

[Paper] No Universal Courtesy: A Cross-Linguistic, Multi-Model Study of Politeness Effects on LLMs Using the PLUM Corpus

[Paper] VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

[Paper] From Benchmarking to Reasoning: A Dual-Aspect, Large-Scale Evaluation of LLMs on Vietnamese Legal Text