[Paper] LibContinual: A Comprehensive Library towards Realistic Continual Learning

Published: 4 months ago (December 26, 2025 at 08:59 AM EST)

5 min read

Source: arXiv

Source: arXiv - 2512.22029v1

Overview

Continual Learning (CL) promises AI systems that can keep learning new tasks without erasing what they already know, but in practice the field suffers from fragmented codebases and inconsistent evaluation protocols. The new LibContinual library tackles this mess by offering a single, well‑engineered platform that bundles 19 state‑of‑the‑art CL algorithms, standardizes the experimental pipeline, and forces researchers to test under realistic constraints such as online data streams, limited memory, and heterogeneous task semantics.

Key Contributions

Unified, production‑ready library: 19 CL algorithms spanning five methodological families (regularization, replay, parameter isolation, architecture growth, and hybrid approaches) are implemented with a common API and dependency set.
Modular architecture: High cohesion / low coupling design makes it easy to plug‑in new methods, datasets, or evaluation metrics without breaking existing code.
Critical audit of hidden assumptions: The authors expose three “implicit” assumptions that most papers make—offline data access, unlimited replay memory, and intra‑task semantic homogeneity—and show how they inflate reported performance.
Realistic evaluation protocols: Introduces (1) strict online learning (data arrives once, no revisits), (2) a unified memory‑budget protocol that caps total replay storage across the whole life‑time, and (3) a category‑randomized benchmark that mixes semantically unrelated tasks.
Open‑source and reproducible: Full code, documentation, and pre‑configured Docker images are released, lowering the barrier for both academic and industry teams to adopt realistic CL testing.

Methodology

LibContinual is built around a pipeline abstraction that separates four core components:

Data Loader – streams data in a single pass (online mode) or batch mode for baseline comparison.
Model Wrapper – encapsulates any PyTorch model, exposing hooks for regularization terms, parameter masks, or replay buffers.
Trainer – orchestrates the learning loop, handling task boundaries, memory updates, and metric logging.
Evaluator – computes continual‑learning metrics (average accuracy, forgetting, forward/backward transfer) under the chosen budget constraints.

The authors then run three systematic experiments:

Offline vs. Online: Compare each algorithm when data can be revisited (traditional setting) versus a strict one‑pass stream.
Unlimited vs. Fixed Replay Memory: Enforce a global memory cap (e.g., 200 MiB) that all replay‑based methods must share, rather than allowing each method its own unlimited buffer.
Semantic Homogeneity vs. Randomized Categories: Shuffle task labels across unrelated categories (e.g., mixing animal, vehicle, and medical image classes) to test robustness to semantic drift.

All experiments are executed with identical hyper‑parameters (learning rate, batch size, optimizer) to ensure a fair apples‑to‑apples comparison.

Results & Findings

Setting	Best‑performing family (average accuracy)	Typical drop vs. traditional eval
Offline (standard)	Replay‑based methods (e.g., iCaRL, GEM) ~ 78%	–
Online stream	Regularization‑based (e.g., EWC, LwF) ~ 65%	‑13 pp on average
Fixed memory budget	Hybrid (e.g., DER++) ~ 62%	‑16 pp compared to unlimited memory
Category‑randomized	Parameter‑isolation (e.g., PackNet) ~ 58%	‑20 pp relative to homogeneous tasks

Key takeaways

Replay methods crumble when the memory budget is capped; they rely heavily on storing many exemplars.
Regularization and isolation strategies are more resilient to online constraints but still suffer noticeable accuracy loss.
Hybrid approaches that combine modest replay with architectural tricks (e.g., DER++) strike the best balance under realistic limits.
Across all settings, the average forgetting rate spikes dramatically, confirming that many published numbers are optimistic artifacts of hidden assumptions.

Practical Implications

Product teams building edge AI (e.g., on‑device assistants, robotics) can now benchmark CL algorithms against the same memory ceiling they will face in the field, avoiding costly “surprise” performance drops after deployment.
MLOps pipelines can integrate LibContinual as a plug‑in test stage, automatically validating that new continual‑learning models meet online‑learning and memory‑budget criteria before promotion.
Framework developers (e.g., PyTorch Lightning, TensorFlow) gain a reference implementation for standardizing CL APIs, which could evolve into a community‑wide extension.
Research‑to‑product translation becomes faster: teams can prototype a CL method, swap it with any of the 19 built‑in algorithms, and instantly see how it behaves under realistic constraints, informing design decisions early.

Limitations & Future Work

Scope of tasks: The benchmark currently focuses on image classification (CIFAR‑100, TinyImageNet). Extending to NLP, reinforcement learning, or multimodal streams is left for future releases.
Hardware diversity: Experiments were run on a single GPU class; the impact of heterogeneous edge devices (CPU‑only, low‑power ASICs) is not quantified.
Memory budget granularity: A single global cap is a useful abstraction, but real systems may have tiered storage (RAM vs. flash) that requires more nuanced budgeting strategies.
Algorithm coverage: While 19 methods are substantial, emerging paradigms such as meta‑continual learning or neuromorphic spiking networks are not yet integrated.

The authors plan to broaden dataset coverage, add plug‑in support for hardware‑aware budgeting, and open a community leaderboard to keep the library aligned with real‑world deployment needs.

Authors

Wenbin Li
Shangge Liu
Borui Kang
Yiyang Chen
KaXuan Lew
Yang Chen
Yinghuan Shi
Lei Wang
Yang Gao
Jiebo Luo

Paper Information

arXiv ID: 2512.22029v1
Categories: cs.LG, cs.AI
Published: December 26, 2025
PDF: Download PDF

[Paper] LibContinual: A Comprehensive Library towards Realistic Continual Learning

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Agentic Structured Graph Traversal for Root Cause Analysis of Code-related Incidents in Cloud Applications

[Paper] Pruning as a Game: Equilibrium-Driven Sparsification of Neural Networks

[Paper] Explainable Multimodal Regression via Information Decomposition

[Paper] A2P-Vis: an Analyzer-to-Presenter Agentic Pipeline for Visual Insights Generation and Reporting