[Paper] Structured Document Translation via Format Reinforcement Learning

Published: 2 months ago (December 4, 2025 at 01:58 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.05100v1

Overview

The paper introduces Format Reinforcement Learning (FormatRL), a new way to translate structured documents such as XML or HTML while preserving their hierarchical layout. By combining a standard fine‑tuned translation model with a reinforcement‑learning (RL) layer that directly optimizes structure‑aware rewards, the authors achieve higher fidelity translations on a real‑world software‑documentation benchmark.

Key Contributions

FormatRL framework: integrates Group Relative Policy Optimization (GRPO) on top of a supervised translation model to jointly optimize translation quality and structural correctness.
Novel rewards:
1. TreeSim – a similarity metric that compares the predicted XML/HTML tree to the reference tree, rewarding correct nesting and tag placement.
2. Node‑chrF – a character‑level F‑score computed per XML node, encouraging accurate translation of the textual content inside each tag.
StrucAUC metric: a fine‑grained evaluation that distinguishes minor formatting slips from catastrophic structural failures, providing clearer insight into model behavior.
Empirical validation: extensive experiments on the SAP software‑documentation dataset show consistent gains across six evaluation metrics, including both traditional translation scores (BLEU, chrF) and the new structure‑aware scores.
Ablation analysis: demonstrates how each reward component contributes to improvements in structural integrity versus linguistic quality.

Methodology

Base model – a standard sequence‑to‑sequence transformer is first fine‑tuned on parallel structured‑document data (source XML ↔ target XML).
Reinforcement layer – the fine‑tuned model becomes the “policy” in an RL loop. Instead of maximizing likelihood alone, the policy is updated with Group Relative Policy Optimization (GRPO), a stable policy‑gradient algorithm that works well with sparse, high‑variance rewards.
Reward design:
- TreeSim computes the tree edit distance between the predicted and reference XML trees, normalizing it to a similarity score (higher is better).
- Node‑chrF evaluates the translation quality inside each XML node, then aggregates across the document.
- The final reward is a weighted sum of TreeSim and Node‑chrF, allowing the system to balance structural fidelity against linguistic accuracy.
Training loop – after each batch, the model samples a set of candidate translations, scores them with the combined reward, and updates the policy using GRPO. The supervised loss is still retained as a regularizer to keep the model grounded.

The approach is deliberately modular: any existing translation model can be “plugged in,” and the reward functions can be swapped or extended for other markup languages (e.g., JSON, Markdown).

Results & Findings

Metric	Baseline (Supervised)	FormatRL	Δ
BLEU	38.2	40.5	+2.3
chrF	57.1	59.8	+2.7
TreeSim	0.71	0.84	+0.13
Node‑chrF	0.68	0.81	+0.13
StrucAUC (minor errors)	0.62	0.78	+0.16
StrucAUC (major failures)	0.91	0.97	+0.06

Structural gains: TreeSim and StrucAUC improvements indicate that FormatRL produces far fewer broken tag hierarchies and misplaced nodes.
Translation quality: BLEU and chrF also rise, showing that the RL fine‑tuning does not sacrifice linguistic fidelity.
Ablation: Removing TreeSim from the reward drops structural scores back to baseline levels, while keeping only Node‑chrF improves BLEU but leaves many tag errors untouched. This confirms the necessity of both rewards.

Overall, the model delivers translations that are both readable and well‑formed—a crucial combination for downstream applications that consume structured data.

Practical Implications

Software documentation pipelines: Companies can automate the localization of API docs, user manuals, or help‑center articles without manual post‑processing to fix broken XML/HTML.
Content management systems (CMS): FormatRL can be integrated as a plug‑in to translate web pages while preserving layout, reducing QA effort for multilingual sites.
Data‑driven UI generation: Front‑end frameworks that render UI from markup (e.g., React JSX, Vue templates) can safely consume translated components, avoiding runtime rendering errors caused by malformed tags.
Regulatory compliance: In domains where document structure encodes legal semantics (e.g., contracts in XML), preserving hierarchy is mandatory; FormatRL offers a path to trustworthy machine translation.
Developer tooling: The reward functions (TreeSim, Node‑chrF) are open‑source and can be reused to evaluate any translation system that works with markup, providing a more meaningful benchmark than BLEU alone.

Limitations & Future Work

Domain specificity: Experiments focus on SAP software documentation; performance on other markup‑heavy domains (e.g., scientific articles, legal contracts) remains untested.
Scalability of RL: Reinforcement learning adds computational overhead, especially when sampling many candidate translations per batch. Optimizing the trade‑off between sample size and training time is an open challenge.
Reward engineering: The current weighted sum of TreeSim and Node‑chrF works well, but finding optimal weights may require domain‑specific tuning. Future work could explore adaptive weighting or multi‑objective RL.
Extension to multimodal documents: Handling embedded media (images, tables) and cross‑references within the markup is not covered; integrating visual or tabular consistency checks is a promising direction.

By addressing these points, the community can move toward truly universal, structure‑aware machine translation that works across the full spectrum of modern, markup‑rich content.

Authors

Haiyue Song
Johannes Eschbach-Dymanus
Hour Kaing
Sumire Honda
Hideki Tanaka
Bianka Buschbeck
Masao Utiyama

Paper Information

arXiv ID: 2512.05100v1
Categories: cs.CL, cs.AI, cs.LG
Published: December 4, 2025
PDF: Download PDF

[Paper] Structured Document Translation via Format Reinforcement Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Enhancing Retrieval-Augmented Generation with Entity Linking for Educational Platforms

[Paper] M4-RAG: A Massive-Scale Multilingual Multi-Cultural Multimodal RAG

[Paper] Zoom in, Click out: Unlocking and Evaluating the Potential of Zooming for GUI Grounding

[Paper] To Err Is Human: Systematic Quantification of Errors in Published AI Papers via LLM Analysis