[Paper] Reverso: Efficient Time Series Foundation Models for Zero-shot Forecasting

Published: (February 19, 2026 at 01:48 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2602.17634v1

Overview

The paper “Reverso: Efficient Time Series Foundation Models for Zero‑shot Forecasting” shows that you don’t need massive, transformer‑based models to get strong zero‑shot forecasting performance. By swapping heavyweight attention layers for a lightweight hybrid of long‑range convolutions and linear RNNs (DeltaNet), the authors build foundation models that are 100× smaller while staying competitive—or even superior—on a wide range of time‑series tasks.

Key Contributions

  • Compact architecture: Introduces a hybrid model that interleaves long‑horizon convolutions with DeltaNet linear RNN layers, eliminating the need for large‑scale transformers.
  • Performance‑efficiency breakthrough: Demonstrates that models with < 1 M parameters can match or exceed the accuracy of transformer models with hundreds of millions of parameters on zero‑shot forecasting benchmarks.
  • Data‑centric tricks: Proposes simple yet effective augmentation (e.g., random scaling, jitter, masking) and inference tricks (e.g., test‑time ensembling, sliding‑window voting) that boost accuracy without extra training cost.
  • Reverso family: Releases a suite of pretrained models (Reverso‑S, Reverso‑M, Reverso‑L) covering different size‑accuracy trade‑offs, all publicly available.
  • Pareto‑frontier analysis: Provides a systematic comparison of model size, FLOPs, and forecasting error, establishing a new state‑of‑the‑art point on the performance‑efficiency curve for time‑series foundation models.

Methodology

  1. Hybrid backbone – The model stacks two building blocks:

    • Long convolution layers (kernel sizes up to 512) capture distant temporal patterns efficiently via FFT‑based convolution.
    • DeltaNet linear RNN layers propagate information linearly, offering a cheap alternative to self‑attention while preserving sequence order.
      The alternating pattern lets the network learn both global trends and fine‑grained dynamics without quadratic attention costs.
  2. Pretraining regime – A single large, heterogeneous corpus of > 10 k time‑series (finance, electricity, traffic, weather, etc.) is used. The objective is a masked‑reconstruction loss similar to BERT: randomly hide contiguous windows and ask the model to predict them, encouraging robust representations that generalize across domains.

  3. Data augmentation – During pretraining, each series is randomly transformed (amplitude scaling, time warping, additive noise, and segment dropout). This forces the model to learn invariances that are crucial for zero‑shot transfer.

  4. Inference tricks – At test time, the authors apply:

    • Sliding‑window ensembling – multiple overlapping forecasts are averaged, reducing variance.
    • Multi‑scale prompting – the same series is fed at different down‑sampling rates, and predictions are combined.

All components are deliberately simple and can be reproduced with standard deep‑learning libraries (PyTorch, TensorFlow).

Results & Findings

ModelParamsFLOPs (per forecast)MSE ↓ (average)Relative speed ↑
Large Transformer (baseline)300 M1.2 G0.92
Reverso‑S (small)0.8 M4 M0.94≈ 300×
Reverso‑M (medium)3 M12 M0.91≈ 120×
Reverso‑L (large)12 M45 M0.89≈ 30×
  • Accuracy: Even the tiniest Reverso‑S matches the baseline transformer within 2 % relative error, while Reverso‑L actually outperforms it by ~3 %.
  • Efficiency: Inference latency drops from several seconds to tens of milliseconds on a single CPU core, making real‑time deployment feasible.
  • Zero‑shot transfer: When evaluated on unseen domains (e.g., cryptocurrency prices, solar irradiance), the Reverso models retain their edge, confirming that the learned representations are truly domain‑agnostic.

Practical Implications

  • Edge & IoT deployment – The sub‑megabyte footprint means you can run a capable forecaster on micro‑controllers, routers, or mobile devices without cloud calls.
  • Cost‑effective SaaS – Cloud providers can serve thousands of forecasting requests per GPU, slashing compute bills for analytics platforms.
  • Rapid prototyping – Developers can plug the pretrained Reverso model into existing pipelines (e.g., Prophet, ARIMA wrappers) and obtain strong baselines without any task‑specific fine‑tuning.
  • Unified forecasting service – Companies with heterogeneous time‑series (sales, sensor logs, user activity) can adopt a single model instead of maintaining a zoo of specialized algorithms.

Limitations & Future Work

  • Long‑horizon degradation – Forecasts beyond 200 steps start to lose fidelity; the authors suggest integrating hierarchical decoding or external memory.
  • Limited interpretability – While the architecture is simpler than transformers, the linear RNN dynamics are still opaque; future work could add attention‑style attribution layers.
  • Domain‑specific fine‑tuning – The paper focuses on zero‑shot performance; exploring lightweight fine‑tuning (e.g., LoRA adapters) could push accuracy further for high‑stakes domains like finance.
  • Benchmark breadth – Experiments cover 12 public datasets; expanding to ultra‑high‑frequency data (nanosecond tick data) would test the limits of the convolution‑RNN combo.

Overall, Reverso demonstrates that efficiency need not sacrifice accuracy in time‑series foundation models, opening the door for scalable, real‑world forecasting solutions.

Authors

  • Xinghong Fu
  • Yanhong Li
  • Georgios Papaioannou
  • Yoon Kim

Paper Information

  • arXiv ID: 2602.17634v1
  • Categories: cs.LG, cs.AI
  • Published: February 19, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »