[Paper] Overview of the TREC 2025 RAGTIME Track

Published: 2 months ago (February 10, 2026 at 12:47 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2602.10024v1

Overview

The RAGTIME track at TREC 2025 investigates how well modern language models can generate concise news reports from multilingual source material. By assembling a multilingual news corpus (Arabic, Chinese, English, Russian) and defining three concrete tasks, the track provides the first large‑scale benchmark for cross‑lingual report generation and multilingual information retrieval (MLIR). The results give developers a clear picture of current capabilities and gaps in building truly multilingual newsroom‑automation pipelines.

Key Contributions

Multilingual Corpus: Curated a balanced set of news stories in four languages, complete with human‑written reference reports.
Three Benchmark Tasks:
1. Multilingual Report Generation (MRG) – generate a report in any language from a mixed‑language source set.
2. English Report Generation (ERG) – generate an English summary from multilingual sources.
3. Multilingual Information Retrieval (MLIR) – retrieve the most relevant source documents for a given query across languages.
Comprehensive Evaluation Suite: Combines automatic metrics (BLEU, ROUGE, METEOR, chrF, nDCG) with human assessments of factuality, fluency, and cross‑lingual coherence.
Baseline & Leaderboard: Provides strong baselines (e.g., mT5, XLM‑R, multilingual Pegasus) and a public leaderboard with 125 runs from 13 teams.
Analysis of Failure Modes: Identifies common errors such as language‑mixing in outputs, hallucinated facts, and retrieval bias toward high‑resource languages.

Methodology

Data Collection – Newswire articles were harvested from reputable outlets in Arabic, Chinese, English, and Russian (≈ 200 k documents). Human annotators wrote one‑paragraph reports in each language, yielding a gold‑standard reference set.
Task Definition
- MRG: Input = a set of documents in any combination of the four languages; Output = a report in the language of the query (or any language for the multilingual variant).
- ERG: Same input, but the output must be English.
- MLIR: Input = a multilingual query; Output = ranked list of source documents regardless of language.
Systems – Participants built pipelines that typically combined:
- Multilingual Retrieval (dense vectors from multilingual BERT/XLM‑R, BM25 fallback).
- Cross‑lingual Fusion (re‑ranking with language‑agnostic relevance models).
- Generation (encoder‑decoder LMs fine‑tuned on the RAGTIME corpus).
Evaluation – Automatic scores were computed on the held‑out test set; a subset of runs underwent human evaluation via crowdsourcing platforms, focusing on factual correctness and readability across languages.

Results & Findings

Task	Best Automatic Score (BLEU/chrF)	Human Fluency (1‑5)	Notable Observations
MRG	BLEU 23.1 / chrF 56.4	4.1	Systems that performed language identification before generation outperformed end‑to‑end multilingual models.
ERG	BLEU 27.8 / chrF 60.2	4.3	English‑only fine‑tuning gave a modest boost; however, hallucinations rose by ~12 % compared to MRG.
MLIR	nDCG@10 0.71	—	Retrieval models biased toward English documents; multilingual dense retrieval reduced this bias by 18 %.

Overall, the top systems leveraged language‑aware retrieval + monolingual generation (e.g., retrieve documents, translate to English, then generate). Purely multilingual generators lagged behind, especially on low‑resource languages (Arabic, Russian). Human judges flagged factual drift as the primary error, not fluency.

Practical Implications

Newsrooms & Content Aggregators: The benchmark demonstrates that a retrieve‑translate‑generate pipeline can already produce usable English digests from mixed‑language feeds, enabling faster global coverage.
Multilingual Search Engines: Insights from the MLIR task help improve cross‑lingual ranking, reducing the English‑centric bias that hurts user experience in non‑English markets.
LLM Fine‑tuning Strategies: The success of language‑identification pre‑steps suggests that developers should incorporate language tags or language‑specific adapters when building multilingual generation services.
Compliance & Fact‑Checking: The identified hallucination patterns highlight the need for post‑generation verification modules (e.g., retrieval‑augmented generation) before deploying automated reports in regulated domains.

Limitations & Future Work

Domain Narrowness: The corpus is limited to newswire; performance may differ on scientific, legal, or social‑media text.
Language Coverage: Only four languages were included; extending to low‑resource languages (e.g., Swahili, Hindi) remains an open challenge.
Evaluation Gaps: Automatic metrics still correlate weakly with human judgments on factuality; richer evaluation frameworks (e.g., factuality‑oriented metrics) are needed.
Scalability: Current top systems rely on multiple stages (retrieval, translation, generation), which can be latency‑heavy for real‑time applications. Future work aims at end‑to‑end multilingual generation that maintains factual grounding while reducing pipeline complexity.

Authors

Dawn Lawrie
Sean MacAvaney
James Mayfield
Luca Soldaini
Eugene Yang
Andrew Yates

Paper Information

arXiv ID: 2602.10024v1
Categories: cs.IR, cs.CL
Published: February 10, 2026
PDF: Download PDF

[Paper] Overview of the TREC 2025 RAGTIME Track

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Semantic Chunking and the Entropy of Natural Language

[Paper] CoPE-VideoLM: Codec Primitives For Efficient Video Language Models

[Paper] Quantization-Robust LLM Unlearning via Low-Rank Adaptation

[Paper] OpenLID-v3: Improving the Precision of Closely Related Language Identification -- An Experience Report