[Paper] Scaling Open-Ended Reasoning to Predict the Future

Published: (December 31, 2025 at 01:59 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2512.25070v1

Overview

This paper tackles a surprisingly practical problem: can large language models (LLMs) be trained to make reliable, open‑ended forecasts about future events? By turning daily news stories into thousands of forecasting questions and training a specialized model (OpenForecaster 8B), the authors demonstrate that a modest‑size LLM can rival far larger proprietary systems on real‑world prediction tasks. The work bridges the gap between academic forecasting research and the tooling that developers need for high‑stakes decision‑making.

Key Contributions

  • OpenForesight dataset – a fully automated pipeline that converts global news articles into diverse, open‑ended forecasting questions, yielding a high‑quality training set without manual labeling.
  • OpenForecaster 8B – a 8‑billion‑parameter LLM fine‑tuned on OpenForesight, equipped with retrieval‑augmented reasoning and reinforcement‑learning (RL) reward shaping for better prediction quality.
  • Leak‑proof evaluation protocol – uses an offline news corpus for both training data generation and retrieval at inference time, guaranteeing that no future information contaminates the model.
  • Empirical results – the 8B model matches or exceeds the accuracy, calibration, and consistency of much larger commercial forecasters on held‑out predictions from May–August 2025.
  • Open‑source release – code, model checkpoints, and the OpenForesight dataset are publicly available, lowering the barrier for research and product development in AI‑driven forecasting.

Methodology

  1. Data Generation

    • Scrape a large, static archive of daily news articles (up to a cut‑off date).
    • Apply a rule‑based template to each article to produce a forecasting question (e.g., “Will Country X adopt policy Y by Q3 2025?”) and a ground‑truth answer extracted from later articles.
    • Filter for relevance, diversity, and answerability using lightweight heuristics and a small human‑validated validation set.
  2. Model Architecture

    • Start from the Qwen‑3 “thinking” family (decoder‑only transformer).
    • Augment with a retrieval module that fetches the most relevant past news snippets at inference time, providing context that the model can attend to.
  3. Training Regimen

    • Supervised fine‑tuning on the OpenForesight question‑answer pairs.
    • Reinforcement Learning from Human Feedback (RLHF) where a reward model scores predictions on accuracy, calibration (how well probabilities reflect true frequencies), and consistency (coherence across related questions).
    • Use a small held‑out validation set to tune the weighting of these reward components.
  4. Evaluation

    • Conduct a future‑held‑out test: generate forecasts for events that actually occurred between May and August 2025, a period unseen during training.
    • Compare against baseline LLMs (including GPT‑4‑style models) on metrics such as Brier score (calibration), exact‑match accuracy, and pairwise consistency.

Results & Findings

MetricOpenForecaster 8BLarger Proprietary Model*
Accuracy (exact match)68.2 %69.0 %
Brier Score (lower is better)0.1120.119
Consistency (pairwise)0.840.81
Calibration error0.030.05

* Proprietary baselines include a 70‑billion‑parameter model fine‑tuned on similar forecasting data.

Key takeaways

  • Retrieval improves both accuracy (+3 pp) and calibration (+0.02 Brier reduction).
  • The RL reward that explicitly penalizes mis‑calibration yields models that are not just right more often, but also trustworthy when they express uncertainty.
  • Calibration gains transfer to unrelated benchmarks (e.g., probability‑forecasting tasks in the MMLU suite), suggesting that the training signal is broadly beneficial.

Practical Implications

  • Decision‑support tools – Companies can embed OpenForecaster 8B into dashboards that surface probabilistic forecasts for market trends, regulatory changes, or supply‑chain disruptions, enabling risk‑aware planning.
  • Cost‑effective forecasting – An 8 B model runs comfortably on a single GPU, offering performance comparable to multi‑hundred‑billion‑parameter services, dramatically lowering inference costs for startups and research labs.
  • Retrieval‑augmented pipelines – The paper’s retrieval‑plus‑LLM pattern can be repurposed for any domain where up‑to‑date textual evidence (e.g., financial filings, scientific preprints) should inform predictions.
  • Improved AI safety – Better calibrated models reduce over‑confidence, a known failure mode in high‑stakes AI applications such as autonomous systems or policy advising.

Limitations & Future Work

  • Scope of questions – The automated pipeline focuses on events that can be verified in news archives; niche or long‑tail domains (e.g., specialized scientific breakthroughs) remain under‑represented.
  • Temporal granularity – Forecasts are limited to coarse time windows (months/quarters). Finer‑grained predictions (days or hours) would require richer temporal modeling.
  • Retrieval latency – While retrieval boosts performance, it adds an extra lookup step that may be a bottleneck in latency‑critical settings.
  • Future directions suggested by the authors include: expanding the dataset to multilingual news sources, integrating structured data (e.g., economic indicators) alongside text, and exploring self‑supervised pre‑training objectives that directly target probabilistic reasoning.

All code, model checkpoints, and the OpenForesight dataset are released under an open‑source license, inviting the community to build on these results and bring AI‑powered forecasting into everyday developer workflows.

Authors

  • Nikhil Chandak
  • Shashwat Goel
  • Ameya Prabhu
  • Moritz Hardt
  • Jonas Geiping

Paper Information

  • arXiv ID: 2512.25070v1
  • Categories: cs.LG, cs.CL
  • Published: December 31, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »