[Paper] Overview of the TREC 2025 RAGTIME Track

Published: (February 10, 2026 at 12:47 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.10024v1

Overview

The RAGTIME track at TREC 2025 investigates how well modern language models can generate concise news reports from multilingual source material. By assembling a multilingual news corpus (Arabic, Chinese, English, Russian) and defining three concrete tasks, the track provides the first large‑scale benchmark for cross‑lingual report generation and multilingual information retrieval (MLIR). The results give developers a clear picture of current capabilities and gaps in building truly multilingual newsroom‑automation pipelines.

Key Contributions

  • Multilingual Corpus: Curated a balanced set of news stories in four languages, complete with human‑written reference reports.
  • Three Benchmark Tasks:
    1. Multilingual Report Generation (MRG) – generate a report in any language from a mixed‑language source set.
    2. English Report Generation (ERG) – generate an English summary from multilingual sources.
    3. Multilingual Information Retrieval (MLIR) – retrieve the most relevant source documents for a given query across languages.
  • Comprehensive Evaluation Suite: Combines automatic metrics (BLEU, ROUGE, METEOR, chrF, nDCG) with human assessments of factuality, fluency, and cross‑lingual coherence.
  • Baseline & Leaderboard: Provides strong baselines (e.g., mT5, XLM‑R, multilingual Pegasus) and a public leaderboard with 125 runs from 13 teams.
  • Analysis of Failure Modes: Identifies common errors such as language‑mixing in outputs, hallucinated facts, and retrieval bias toward high‑resource languages.

Methodology

  1. Data Collection – Newswire articles were harvested from reputable outlets in Arabic, Chinese, English, and Russian (≈ 200 k documents). Human annotators wrote one‑paragraph reports in each language, yielding a gold‑standard reference set.
  2. Task Definition
    • MRG: Input = a set of documents in any combination of the four languages; Output = a report in the language of the query (or any language for the multilingual variant).
    • ERG: Same input, but the output must be English.
    • MLIR: Input = a multilingual query; Output = ranked list of source documents regardless of language.
  3. Systems – Participants built pipelines that typically combined:
    • Multilingual Retrieval (dense vectors from multilingual BERT/XLM‑R, BM25 fallback).
    • Cross‑lingual Fusion (re‑ranking with language‑agnostic relevance models).
    • Generation (encoder‑decoder LMs fine‑tuned on the RAGTIME corpus).
  4. Evaluation – Automatic scores were computed on the held‑out test set; a subset of runs underwent human evaluation via crowdsourcing platforms, focusing on factual correctness and readability across languages.

Results & Findings

TaskBest Automatic Score (BLEU/chrF)Human Fluency (1‑5)Notable Observations
MRGBLEU 23.1 / chrF 56.44.1Systems that performed language identification before generation outperformed end‑to‑end multilingual models.
ERGBLEU 27.8 / chrF 60.24.3English‑only fine‑tuning gave a modest boost; however, hallucinations rose by ~12 % compared to MRG.
MLIRnDCG@10 0.71Retrieval models biased toward English documents; multilingual dense retrieval reduced this bias by 18 %.

Overall, the top systems leveraged language‑aware retrieval + monolingual generation (e.g., retrieve documents, translate to English, then generate). Purely multilingual generators lagged behind, especially on low‑resource languages (Arabic, Russian). Human judges flagged factual drift as the primary error, not fluency.

Practical Implications

  • Newsrooms & Content Aggregators: The benchmark demonstrates that a retrieve‑translate‑generate pipeline can already produce usable English digests from mixed‑language feeds, enabling faster global coverage.
  • Multilingual Search Engines: Insights from the MLIR task help improve cross‑lingual ranking, reducing the English‑centric bias that hurts user experience in non‑English markets.
  • LLM Fine‑tuning Strategies: The success of language‑identification pre‑steps suggests that developers should incorporate language tags or language‑specific adapters when building multilingual generation services.
  • Compliance & Fact‑Checking: The identified hallucination patterns highlight the need for post‑generation verification modules (e.g., retrieval‑augmented generation) before deploying automated reports in regulated domains.

Limitations & Future Work

  • Domain Narrowness: The corpus is limited to newswire; performance may differ on scientific, legal, or social‑media text.
  • Language Coverage: Only four languages were included; extending to low‑resource languages (e.g., Swahili, Hindi) remains an open challenge.
  • Evaluation Gaps: Automatic metrics still correlate weakly with human judgments on factuality; richer evaluation frameworks (e.g., factuality‑oriented metrics) are needed.
  • Scalability: Current top systems rely on multiple stages (retrieval, translation, generation), which can be latency‑heavy for real‑time applications. Future work aims at end‑to‑end multilingual generation that maintains factual grounding while reducing pipeline complexity.

Authors

  • Dawn Lawrie
  • Sean MacAvaney
  • James Mayfield
  • Luca Soldaini
  • Eugene Yang
  • Andrew Yates

Paper Information

  • arXiv ID: 2602.10024v1
  • Categories: cs.IR, cs.CL
  • Published: February 10, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »