[Paper] Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting

Published: (March 9, 2026 at 01:59 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2603.08707v1

Overview

The paper “Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting” introduces a new way to evaluate forecasting models that must operate in a constantly changing world. Instead of the usual static train‑test split, the authors set up a live benchmark that scores models continuously on an ever‑updating GitHub activity stream, exposing how well models cope with temporal drift, distribution shifts, and long‑term stability.

Key Contributions

  • Live, rolling‑window benchmark – a continuously refreshed evaluation pipeline that scores forecasts day‑by‑day on a non‑stationary data stream.
  • Open‑source dataset from GitHub – time‑series derived from the top 400 starred repositories (issues, PRs, pushes, new stars), capturing real‑world dynamics like releases, tooling changes, and external events.
  • Standardized protocols & leaderboard – clear rules for data ingestion, model submission, and performance tracking, enabling reproducible, ongoing comparison across research groups and industry teams.
  • Empirical analysis of foundation‑style models – demonstrates how static benchmarks can over‑estimate performance and highlights the gap between claimed “generalization” and actual temporal robustness.
  • Open‑source tooling – the benchmark code, dashboard, and data pipelines are publicly available, encouraging community contributions and extensions to other domains.

Methodology

  1. Data Collection – The authors continuously pull activity logs (issues opened, pull requests opened, push events, new stargazers) from GitHub’s public API for the 400 most‑starred repositories. Each metric forms a separate univariate time series.
  2. Rolling Evaluation Window – Every day, a new observation is added to each series. Models are asked to forecast a fixed horizon (e.g., next 7 days) using only data available up to the current day. After the forecast horizon passes, the predictions are scored and the window slides forward.
  3. Metrics – Standard forecasting error measures (MAE, RMSE, MAPE) are computed per series and aggregated across all repositories. The benchmark also tracks stability metrics such as variance of error over time.
  4. Submission Protocol – Participants submit a Docker container or a Python script that receives the latest training window and returns forecasts. The benchmark orchestrates execution, logs results, and updates the public leaderboard automatically.
  5. Baseline Models – The paper evaluates several baselines (ARIMA, Prophet, simple exponential smoothing) and a few recent foundation‑style models (e.g., Temporal Fusion Transformers pre‑trained on large corpora) to illustrate the benchmark’s diagnostic power.

Results & Findings

  • Static vs. Live Performance Gap – Models that ranked top on a traditional frozen test set dropped 15‑30 % in accuracy when evaluated live, revealing hidden over‑fitting to the static split.
  • Temporal Drift Sensitivity – Foundation models showed strong short‑term forecasts but struggled during abrupt regime changes (e.g., a major repository release or a sudden surge in contributions due to a security incident).
  • Stability Matters – Models with slightly higher average error but lower variance (e.g., simple exponential smoothing) maintained more reliable performance over time, which is valuable for production monitoring.
  • Benchmark Feasibility – The live pipeline ran with low latency (≈ 5 minutes per daily update) and scaled to hundreds of series, proving that continuous benchmarking is operationally practical.

Practical Implications

  • Better Model Selection for Production – Teams can now prioritize models that demonstrate sustained performance, not just peak accuracy on a static hold‑out set, reducing surprise failures in production.
  • Continuous Monitoring as a Service – The Impermanent framework can be adapted to other streaming domains (e.g., IoT sensor data, financial tick data), offering a plug‑and‑play service for evaluating any forecasting pipeline in real time.
  • Guidance for Foundation Model Vendors – The benchmark highlights the need for training procedures that explicitly account for temporal distribution shift, encouraging the development of pre‑training objectives and fine‑tuning strategies that improve temporal robustness.
  • Developer Tooling – The open‑source dashboard provides instant visual feedback on forecast quality, enabling rapid debugging and iterative improvement of forecasting codebases.

Limitations & Future Work

  • Domain Specificity – The current dataset focuses on GitHub activity, which, while highly dynamic, may not capture all types of temporal non‑stationarity (e.g., seasonality in energy demand).
  • Metric Scope – The benchmark emphasizes point‑forecast errors; extending to probabilistic forecasts and calibration metrics would give a fuller picture of uncertainty handling.
  • Scalability to Massive Streams – While the system handles hundreds of series, scaling to tens of thousands (e.g., all public repositories) will require more efficient data pipelines and distributed evaluation.
  • Model Diversity – Future iterations aim to incorporate multimodal and multivariate models that jointly forecast several metrics, reflecting real‑world scenarios where signals interact.

Impermanent pushes the community toward a more realistic, “always‑on” evaluation mindset—one that aligns better with the challenges developers face when deploying time‑series models in production environments.

Authors

  • Azul Garza
  • Renée Rosillo
  • Rodrigo Mendoza‑Smith
  • David Salinas
  • Andrew Robert Williams
  • Arjun Ashok
  • Mononito Goswami
  • José Martín Juárez

Paper Information

  • arXiv ID: 2603.08707v1
  • Categories: cs.LG
  • Published: March 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Scale Space Diffusion

Diffusion models degrade images through noise, and reversing this process reveals an information hierarchy across timesteps. Scale-space theory exhibits a simil...

[Paper] Agentic Critical Training

Training large language models (LLMs) as autonomous agents often begins with imitation learning, but it only teaches agents what to do without understanding why...