[Paper] Temporal Taskification in Streaming Continual Learning: A Source of Evaluation Instability

Published: (April 23, 2026 at 01:59 PM EDT)
4 min read
Source: arXiv

Source: arXiv - 2604.21930v1

Overview

Streaming Continual Learning (CL) aims to train models on never‑ending data streams without forgetting past knowledge. This paper reveals that the way we slice a continuous stream into “tasks” – a step the authors call temporal taskification – is not a harmless preprocessing detail. Different, equally valid task boundaries can lead to dramatically different learning regimes and benchmark outcomes, even when the underlying data, model, and training budget stay the same.

Key Contributions

  • Taskification‑level framework: Introduces plasticity and stability profiles to characterize how a given temporal split behaves before any CL algorithm is applied.
  • Profile distance metric: Quantifies how far apart two taskifications are in terms of their induced learning dynamics.
  • Boundary‑Profile Sensitivity (BPS): A diagnostic that measures how small shifts in task boundaries affect the underlying regime.
  • Empirical study on real network traffic: Evaluates four popular CL strategies (continual fine‑tuning, Experience Replay, Elastic Weight Consolidation, Learning without Forgetting) on the CESNET‑Timeseries24 dataset across multiple temporal granularities (9‑, 30‑, 44‑day splits).
  • Evidence of evaluation instability: Shows that taskification alone can swing forecasting error, forgetting rates, and backward transfer by large margins.
  • Insight on task length: Shorter tasks produce noisier distribution patterns, larger profile distances, and higher BPS, indicating they are more fragile to boundary perturbations.

Methodology

  1. Define the stream – The authors fix a single, long‑term network‑traffic time series (CESNET‑Timeseries24).
  2. Generate multiple taskifications – They partition the same stream into non‑overlapping windows of 9, 30, and 44 days, then create perturbed versions by shifting the window boundaries by a few hours/days.
  3. Compute plasticity & stability profiles – For each taskification, they measure how much the data distribution changes across consecutive tasks (plasticity) and how much it stays the same (stability) without training a model.
  4. Calculate profile distance & BPS – The distance between two taskifications’ profiles quantifies structural differences; BPS aggregates these distances to capture sensitivity to boundary shifts.
  5. Run CL algorithms – Using a fixed neural architecture and training budget, they train the four CL methods on each taskification and record standard metrics: forecasting error, forgetting, and backward transfer.
  6. Analyze variance – By comparing results across taskifications, they isolate the effect of temporal partitioning from model or data changes.

Results & Findings

Task lengthForecasting error (Δ)Forgetting (Δ)Backward transfer (Δ)
9‑day splitsup to +12% vs. 44‑dayup to +18%swings from +5% to ‑7%
30‑day splitsmoderate variations (≈ ±5%)± 9%mixed signs
44‑day splitsmost stable, but still ±3%± 4%small changes
  • Profile distance grows as task length shrinks, confirming that shorter windows create more divergent learning regimes.
  • BPS is highest for 9‑day taskifications (≈ 0.42) and lowest for 44‑day ones (≈ 0.15), indicating that tiny boundary tweaks can drastically reshape the regime for fine‑grained splits.
  • All four CL methods exhibit the same pattern: performance swings are driven primarily by the taskification, not by the algorithm itself.

Practical Implications

  • Benchmark design: When publishing CL results, researchers (and engineers evaluating CL solutions) must report how the stream was taskified. A single benchmark split is insufficient to claim robustness.
  • Model selection for production: In real‑world streaming systems (e.g., network traffic prediction, IoT sensor analytics), the natural “task” boundaries may be ambiguous. Engineers should test CL models across multiple plausible temporal partitions to avoid over‑optimistic performance estimates.
  • Tooling: The paper’s profile‑distance and BPS metrics can be incorporated into CI pipelines for continual‑learning projects, automatically flagging when a new data ingestion schedule could invalidate previously measured performance.
  • Algorithm development: Knowing that short‑window taskifications are highly sensitive suggests a research direction: design CL methods that explicitly account for boundary uncertainty (e.g., adaptive replay buffers that weigh recent vs. older data based on detected distribution shifts).

Limitations & Future Work

  • Single domain: Experiments focus on network‑traffic time series; results may differ for vision, NLP, or multimodal streams.
  • Fixed model & budget: The study holds architecture and compute constant; varying model capacity could interact with taskification effects.
  • Boundary perturbations limited: Only small shifts were examined; larger, irregular splits (e.g., event‑driven boundaries) remain unexplored.
  • Future directions: Extending the framework to multi‑modal streams, integrating taskification‑aware loss functions, and creating standardized “taskification suites” for CL benchmarking.

Authors

  • Nicolae Filat
  • Ahmed Hussain
  • Konstantinos Kalogiannis
  • Elena Burceanu

Paper Information

  • arXiv ID: 2604.21930v1
  • Categories: cs.LG
  • Published: April 23, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »