[Paper] Overview of the TREC 2025 RAGTIME Track
Source: arXiv - 2602.10024v1
Overview
The RAGTIME track at TREC 2025 investigates how well modern language models can generate concise news reports from multilingual source material. By assembling a multilingual news corpus (Arabic, Chinese, English, Russian) and defining three concrete tasks, the track provides the first large‑scale benchmark for cross‑lingual report generation and multilingual information retrieval (MLIR). The results give developers a clear picture of current capabilities and gaps in building truly multilingual newsroom‑automation pipelines.
Key Contributions
- Multilingual Corpus: Curated a balanced set of news stories in four languages, complete with human‑written reference reports.
- Three Benchmark Tasks:
- Multilingual Report Generation (MRG) – generate a report in any language from a mixed‑language source set.
- English Report Generation (ERG) – generate an English summary from multilingual sources.
- Multilingual Information Retrieval (MLIR) – retrieve the most relevant source documents for a given query across languages.
- Comprehensive Evaluation Suite: Combines automatic metrics (BLEU, ROUGE, METEOR, chrF, nDCG) with human assessments of factuality, fluency, and cross‑lingual coherence.
- Baseline & Leaderboard: Provides strong baselines (e.g., mT5, XLM‑R, multilingual Pegasus) and a public leaderboard with 125 runs from 13 teams.
- Analysis of Failure Modes: Identifies common errors such as language‑mixing in outputs, hallucinated facts, and retrieval bias toward high‑resource languages.
Methodology
- Data Collection – Newswire articles were harvested from reputable outlets in Arabic, Chinese, English, and Russian (≈ 200 k documents). Human annotators wrote one‑paragraph reports in each language, yielding a gold‑standard reference set.
- Task Definition
- MRG: Input = a set of documents in any combination of the four languages; Output = a report in the language of the query (or any language for the multilingual variant).
- ERG: Same input, but the output must be English.
- MLIR: Input = a multilingual query; Output = ranked list of source documents regardless of language.
- Systems – Participants built pipelines that typically combined:
- Multilingual Retrieval (dense vectors from multilingual BERT/XLM‑R, BM25 fallback).
- Cross‑lingual Fusion (re‑ranking with language‑agnostic relevance models).
- Generation (encoder‑decoder LMs fine‑tuned on the RAGTIME corpus).
- Evaluation – Automatic scores were computed on the held‑out test set; a subset of runs underwent human evaluation via crowdsourcing platforms, focusing on factual correctness and readability across languages.
Results & Findings
| Task | Best Automatic Score (BLEU/chrF) | Human Fluency (1‑5) | Notable Observations |
|---|---|---|---|
| MRG | BLEU 23.1 / chrF 56.4 | 4.1 | Systems that performed language identification before generation outperformed end‑to‑end multilingual models. |
| ERG | BLEU 27.8 / chrF 60.2 | 4.3 | English‑only fine‑tuning gave a modest boost; however, hallucinations rose by ~12 % compared to MRG. |
| MLIR | nDCG@10 0.71 | — | Retrieval models biased toward English documents; multilingual dense retrieval reduced this bias by 18 %. |
Overall, the top systems leveraged language‑aware retrieval + monolingual generation (e.g., retrieve documents, translate to English, then generate). Purely multilingual generators lagged behind, especially on low‑resource languages (Arabic, Russian). Human judges flagged factual drift as the primary error, not fluency.
Practical Implications
- Newsrooms & Content Aggregators: The benchmark demonstrates that a retrieve‑translate‑generate pipeline can already produce usable English digests from mixed‑language feeds, enabling faster global coverage.
- Multilingual Search Engines: Insights from the MLIR task help improve cross‑lingual ranking, reducing the English‑centric bias that hurts user experience in non‑English markets.
- LLM Fine‑tuning Strategies: The success of language‑identification pre‑steps suggests that developers should incorporate language tags or language‑specific adapters when building multilingual generation services.
- Compliance & Fact‑Checking: The identified hallucination patterns highlight the need for post‑generation verification modules (e.g., retrieval‑augmented generation) before deploying automated reports in regulated domains.
Limitations & Future Work
- Domain Narrowness: The corpus is limited to newswire; performance may differ on scientific, legal, or social‑media text.
- Language Coverage: Only four languages were included; extending to low‑resource languages (e.g., Swahili, Hindi) remains an open challenge.
- Evaluation Gaps: Automatic metrics still correlate weakly with human judgments on factuality; richer evaluation frameworks (e.g., factuality‑oriented metrics) are needed.
- Scalability: Current top systems rely on multiple stages (retrieval, translation, generation), which can be latency‑heavy for real‑time applications. Future work aims at end‑to‑end multilingual generation that maintains factual grounding while reducing pipeline complexity.
Authors
- Dawn Lawrie
- Sean MacAvaney
- James Mayfield
- Luca Soldaini
- Eugene Yang
- Andrew Yates
Paper Information
- arXiv ID: 2602.10024v1
- Categories: cs.IR, cs.CL
- Published: February 10, 2026
- PDF: Download PDF