[Paper] Merge and Bound: Direct Manipulations on Weights for Class Incremental Learning

Published: (November 26, 2025 at 10:24 AM EST)
4 min read
Source: arXiv

Source: arXiv - 2511.21490v1

Overview

The paper introduces Merge‑and‑Bound (M & B), a new training recipe for class‑incremental learning (CIL) that works directly on the model’s weights instead of tweaking loss functions or network architectures. By carefully merging and bounding weight updates, the method dramatically cuts down catastrophic forgetting while staying compatible with any existing CIL pipeline.

Key Contributions

  • Weight‑space merging: Two novel merging operations – inter‑task (averaging across all previously learned tasks) and intra‑task (combining multiple checkpoints within the current task) – that reshape the model without architectural changes.
  • Bounded update rule: A principled constraint that forces the new model to stay close to the merged “reference” weights, minimizing cumulative drift and preserving prior knowledge.
  • Plug‑and‑play design: M & B can be dropped into any CIL method (e.g., iCaRL, LUCIR, PODNet) without altering the loss, replay buffer, or network head.
  • State‑of‑the‑art results: Consistently outperforms recent CIL baselines on CIFAR‑100, ImageNet‑Subset, and TinyImageNet, often by 2–5 % absolute accuracy.
  • Comprehensive analysis: Ablation studies that isolate the impact of each merging component and demonstrate robustness to different replay sizes and task orders.

Methodology

  1. Inter‑task weight merging – After finishing task t‑1, the algorithm stores the model’s parameters. When task t begins, it computes a simple average of all stored checkpoints (including the current one). This “global” weight vector serves as a knowledge anchor that embodies what the network has learned so far.

  2. Intra‑task weight merging – During training on task t, several intermediate snapshots (e.g., after each epoch) are collected. These are merged (again by averaging) to produce a task‑specific representation that smooths out noisy updates.

  3. Bounded update – The actual optimization step is constrained by a quadratic penalty that limits the distance between the updated weights and the merged anchor. Concretely, the loss becomes:

    [ \mathcal{L}{\text{total}} = \mathcal{L}{\text{CIL}} + \lambda | \theta - \theta_{\text{merged}} |_2^2, ]

    where (\theta) are the current parameters and (\theta_{\text{merged}}) is the result of the two merges. The hyper‑parameter (\lambda) controls how “tight” the bound is.

  4. Integration – Because the extra term is just a regularizer on the weight vector, it can be added to any existing CIL loss (cross‑entropy, distillation, contrastive, etc.) without touching the model’s architecture or replay buffer.

Results & Findings

DatasetBaseline (e.g., LUCIR)LUCIR + M & BGain
CIFAR‑100 (20 tasks)63.2 %68.1 %+4.9 %
ImageNet‑Subset (10 tasks)71.5 %74.8 %+3.3 %
TinyImageNet (10 tasks)55.0 %58.9 %+3.9 %
  • Reduced forgetting: The average drop in accuracy for the first task after learning all tasks shrank from ~30 % to ~18 % when M & B was applied.
  • Stability across replay sizes: Even with a tiny replay buffer (1 % of the dataset), M & B still delivered a >3 % boost, showing that the weight‑space regularizer does not rely on large exemplars.
  • Ablation: Removing intra‑task merging caused a ~1.2 % drop; removing the bounded term caused a ~2.5 % drop, confirming that both components are essential.

Overall, the experiments demonstrate that staying “close” to a merged weight representation is an effective, low‑overhead way to keep old knowledge alive.

Practical Implications

  • Easy adoption: Developers can add a few lines of code (store checkpoints, compute an average, add the regularizer) to any CIL framework they already use. No new layers, memory‑intensive rehearsal, or custom optimizers are required.
  • Lower compute & memory footprint: Because the method works on the parameter vector itself, it avoids expensive generative replay or large exemplar buffers, making it attractive for edge devices or on‑device continual learning.
  • Robustness to task order: The merging strategy is agnostic to how tasks are presented, which is valuable for real‑world pipelines where data arrives in unpredictable sequences (e.g., incremental product catalogs, evolving sensor modalities).
  • Potential beyond CIL: The bounded‑update idea could be repurposed for other continual‑learning scenarios such as domain adaptation, federated learning, or even fine‑tuning large language models where preserving a “core” representation is critical.

Limitations & Future Work

  • Merging simplicity: The current approach uses plain averaging; more sophisticated merging (e.g., weighted averages based on task difficulty or confidence) might yield further gains.
  • Scalability to very large models: Storing full checkpoints for every task can become memory‑intensive for models with hundreds of millions of parameters; the authors suggest exploring low‑rank or sketch‑based representations.
  • Theoretical guarantees: While empirical results are strong, a formal analysis of why the bounded update mitigates forgetting is left for future research.
  • Extension to non‑classification tasks: The paper focuses on image classification; applying M & B to detection, segmentation, or multimodal tasks remains an open avenue.

Authors

  • Taehoon Kim
  • Donghwan Jang
  • Bohyung Han

Paper Information

  • arXiv ID: 2511.21490v1
  • Categories: cs.CV, cs.AI, cs.LG
  • Published: November 26, 2025
  • PDF: Download PDF
Back to Blog

Related posts

Read more »