[Paper] Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting
Source: arXiv - 2603.08707v1
Overview
The paper “Impermanent: A Live Benchmark for Temporal Generalization in Time Series Forecasting” introduces a new way to evaluate forecasting models that must operate in a constantly changing world. Instead of the usual static train‑test split, the authors set up a live benchmark that scores models continuously on an ever‑updating GitHub activity stream, exposing how well models cope with temporal drift, distribution shifts, and long‑term stability.
Key Contributions
- Live, rolling‑window benchmark – a continuously refreshed evaluation pipeline that scores forecasts day‑by‑day on a non‑stationary data stream.
- Open‑source dataset from GitHub – time‑series derived from the top 400 starred repositories (issues, PRs, pushes, new stars), capturing real‑world dynamics like releases, tooling changes, and external events.
- Standardized protocols & leaderboard – clear rules for data ingestion, model submission, and performance tracking, enabling reproducible, ongoing comparison across research groups and industry teams.
- Empirical analysis of foundation‑style models – demonstrates how static benchmarks can over‑estimate performance and highlights the gap between claimed “generalization” and actual temporal robustness.
- Open‑source tooling – the benchmark code, dashboard, and data pipelines are publicly available, encouraging community contributions and extensions to other domains.
Methodology
- Data Collection – The authors continuously pull activity logs (issues opened, pull requests opened, push events, new stargazers) from GitHub’s public API for the 400 most‑starred repositories. Each metric forms a separate univariate time series.
- Rolling Evaluation Window – Every day, a new observation is added to each series. Models are asked to forecast a fixed horizon (e.g., next 7 days) using only data available up to the current day. After the forecast horizon passes, the predictions are scored and the window slides forward.
- Metrics – Standard forecasting error measures (MAE, RMSE, MAPE) are computed per series and aggregated across all repositories. The benchmark also tracks stability metrics such as variance of error over time.
- Submission Protocol – Participants submit a Docker container or a Python script that receives the latest training window and returns forecasts. The benchmark orchestrates execution, logs results, and updates the public leaderboard automatically.
- Baseline Models – The paper evaluates several baselines (ARIMA, Prophet, simple exponential smoothing) and a few recent foundation‑style models (e.g., Temporal Fusion Transformers pre‑trained on large corpora) to illustrate the benchmark’s diagnostic power.
Results & Findings
- Static vs. Live Performance Gap – Models that ranked top on a traditional frozen test set dropped 15‑30 % in accuracy when evaluated live, revealing hidden over‑fitting to the static split.
- Temporal Drift Sensitivity – Foundation models showed strong short‑term forecasts but struggled during abrupt regime changes (e.g., a major repository release or a sudden surge in contributions due to a security incident).
- Stability Matters – Models with slightly higher average error but lower variance (e.g., simple exponential smoothing) maintained more reliable performance over time, which is valuable for production monitoring.
- Benchmark Feasibility – The live pipeline ran with low latency (≈ 5 minutes per daily update) and scaled to hundreds of series, proving that continuous benchmarking is operationally practical.
Practical Implications
- Better Model Selection for Production – Teams can now prioritize models that demonstrate sustained performance, not just peak accuracy on a static hold‑out set, reducing surprise failures in production.
- Continuous Monitoring as a Service – The Impermanent framework can be adapted to other streaming domains (e.g., IoT sensor data, financial tick data), offering a plug‑and‑play service for evaluating any forecasting pipeline in real time.
- Guidance for Foundation Model Vendors – The benchmark highlights the need for training procedures that explicitly account for temporal distribution shift, encouraging the development of pre‑training objectives and fine‑tuning strategies that improve temporal robustness.
- Developer Tooling – The open‑source dashboard provides instant visual feedback on forecast quality, enabling rapid debugging and iterative improvement of forecasting codebases.
Limitations & Future Work
- Domain Specificity – The current dataset focuses on GitHub activity, which, while highly dynamic, may not capture all types of temporal non‑stationarity (e.g., seasonality in energy demand).
- Metric Scope – The benchmark emphasizes point‑forecast errors; extending to probabilistic forecasts and calibration metrics would give a fuller picture of uncertainty handling.
- Scalability to Massive Streams – While the system handles hundreds of series, scaling to tens of thousands (e.g., all public repositories) will require more efficient data pipelines and distributed evaluation.
- Model Diversity – Future iterations aim to incorporate multimodal and multivariate models that jointly forecast several metrics, reflecting real‑world scenarios where signals interact.
Impermanent pushes the community toward a more realistic, “always‑on” evaluation mindset—one that aligns better with the challenges developers face when deploying time‑series models in production environments.
Authors
- Azul Garza
- Renée Rosillo
- Rodrigo Mendoza‑Smith
- David Salinas
- Andrew Robert Williams
- Arjun Ashok
- Mononito Goswami
- José Martín Juárez
Paper Information
- arXiv ID: 2603.08707v1
- Categories: cs.LG
- Published: March 9, 2026
- PDF: Download PDF