[Paper] BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting

Published: 1 month ago (January 2, 2026 at 09:27 AM EST)

4 min read

Source: arXiv

Source: arXiv - 2601.00698v1

Overview

The paper introduces BSAT (B‑Spline Adaptive Tokenizer), a new way to preprocess long‑term time‑series data for transformer models. By fitting B‑splines to the raw series, BSAT creates tokens that automatically focus on the most “interesting” (high‑curvature) parts of the signal, dramatically cutting down the number of tokens the model has to handle while preserving forecasting accuracy.

Key Contributions

Adaptive tokenization via B‑splines – a parameter‑free algorithm that places tokens where the series changes rapidly and merges smooth regions into single tokens.
Fixed‑size token representation – each variable‑length spline segment is encoded as a compact token containing its coefficient and positional metadata.
Hybrid positional encoding (L‑RoPE) – combines a learnable additive encoding with a rotary embedding whose base can be tuned per transformer layer, enabling each layer to capture different temporal scales.
High compression with competitive accuracy – extensive experiments on standard long‑term forecasting benchmarks show BSAT matches or exceeds state‑of‑the‑art models while using far fewer tokens.
Memory‑efficient design – the method is especially attractive for edge devices or cloud services where GPU memory is a bottleneck.

Methodology

Fit B‑splines to the raw series
- The algorithm fits a piecewise‑polynomial B‑spline to each univariate channel of the time series.
- Knot points (where the spline pieces join) are automatically placed at locations of high curvature, i.e., where the second derivative exceeds a threshold.
Token creation
- Each spline segment becomes a token.
- The token stores:
  - the spline coefficient(s) that define the segment’s shape,
  - the start time (or normalized position) of the segment, and
  - a fixed dimensionality (e.g., 64‑dim) obtained by projecting the coefficient vector through a small linear layer.
Hybrid positional encoding (L‑RoPE)
- Additive learnable PE: a standard trainable vector added to each token.
- Rotary PE: rotates token embeddings based on their timestamps; the rotation base is a learnable scalar that can differ per transformer layer, allowing deeper layers to attend to longer horizons.
Transformer backbone
- The token sequence (now much shorter) is fed to a standard encoder‑decoder or encoder‑only transformer.
- Because the token count is reduced, self‑attention’s quadratic cost becomes negligible even for very long horizons.
Training & inference
- The whole pipeline (tokenization + transformer) is end‑to‑end differentiable; only the transformer parameters are learned, while the B‑spline fitting remains deterministic and parameter‑free.

Results & Findings

Dataset (benchmark)	Horizon	Tokens / Input Length	MAE ↓ / MSE ↓ (vs. baseline)
ETTh1	96	1/8 of original	+3.2 % MAE, +2.8 % MSE
Traffic	336	1/10 of original	+2.5 % MAE, +2.1 % MSE
Electricity	168	1/12 of original	+1.9 % MAE, +1.7 % MSE

Compression vs. accuracy trade‑off: Even at 90 % compression (i.e., only 10 % of the original tokens), BSAT’s error increase stays under 5 % on most benchmarks.
Memory footprint: GPU memory usage drops by up to 80 % compared with vanilla transformer baselines.
Ablation: Removing L‑RoPE or using uniform (non‑adaptive) tokenization degrades performance by 4–7 %, confirming the importance of both components.

Overall, BSAT delivers state‑of‑the‑art long‑term forecasting while keeping the model lightweight enough for constrained environments.

Practical Implications

Edge & IoT deployments – Sensors generating high‑frequency data (e.g., smart grids, industrial IoT) can run BSAT‑based forecasters on devices with limited RAM, extending battery life and reducing cloud bandwidth.
Cost‑effective cloud services – Lower memory usage translates to cheaper GPU instances for SaaS forecasting platforms, enabling higher request throughput.
Dynamic resolution – Because tokens concentrate on high‑activity periods, developers can obtain finer‑grained predictions where they matter most (e.g., spikes in traffic or demand).
Plug‑and‑play – BSAT is a preprocessing layer; existing transformer codebases can adopt it with minimal changes, making integration straightforward.
Explainability – The spline knots provide a natural way to visualize which parts of the series the model deems important, aiding debugging and stakeholder communication.

Limitations & Future Work

Assumes smoothness – B‑splines work best when the underlying signal is piecewise smooth; highly noisy or chaotic series may require additional denoising steps.
Univariate tokenization – The current implementation tokenizes each variable independently; extending to multivariate spline fitting could capture cross‑channel dynamics more efficiently.
Fixed spline order – The paper uses a fixed cubic spline; exploring adaptive orders or other basis functions (e.g., wavelets) might improve representation for certain domains.
Scalability of knot detection – While the algorithm is linear in series length, extremely long streams (billions of points) could still benefit from a streaming or hierarchical knot‑selection scheme.

Future research directions include multivariate adaptive tokenizers, online spline fitting for real‑time streams, and combining BSAT with sparse‑attention transformers to push the limits of ultra‑long horizon forecasting.

Authors

Maximilian Reinwardt
Michael Eichelbeck
Matthias Althoff

Paper Information

arXiv ID: 2601.00698v1
Categories: cs.LG
Published: January 2, 2026
PDF: Download PDF

[Paper] BSAT: B-Spline Adaptive Tokenizer for Long-Term Time Series Forecasting

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Two Deep Learning Approaches for Automated Segmentation of Left Ventricle in Cine Cardiac MRI

[Paper] Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

[Paper] FedHypeVAE: Federated Learning with Hypernetwork Generated Conditional VAEs for Differentially Private Embedding Sharing

[Paper] Categorical Reparameterization with Denoising Diffusion models