[Paper] TSN-Affinity: Similarity-Driven Parameter Reuse for Continual Offline Reinforcement Learning

Published: (April 28, 2026 at 01:41 PM EDT)
5 min read
Source: arXiv

Source: arXiv - 2604.25898v1

Overview

Continual Offline Reinforcement Learning (CORL) tackles the challenge of training a single agent on a stream of tasks without any live interaction—think of updating a robot’s skill set from batches of logged data while still keeping its old abilities intact. The paper TSN‑Affinity introduces a fresh architectural approach that sidesteps the heavy memory and distribution‑shift problems of replay‑based methods, using tiny, task‑specific subnetworks and a similarity‑driven routing scheme to share knowledge only when it makes sense.

Key Contributions

  • TinySubNetwork (TSN) architecture for CORL – each new task gets a lightweight “subnetwork” that reuses a subset of the base model’s parameters.
  • Affinity‑based routing – a novel RL‑aware similarity metric (action compatibility + latent embedding similarity) decides which subnetwork should handle a given state, enabling controlled parameter sharing.
  • Integration with Decision Transformers – leverages the sequence‑modeling strengths of Transformers for offline RL while keeping the TSN overhead minimal.
  • Comprehensive empirical evaluation – experiments on Atari (discrete) and Franka Emika Panda manipulation (continuous) demonstrate superior retention and multi‑task performance compared to replay baselines.
  • Open‑source implementation – code released for reproducibility and community extensions.

Methodology

  1. Base Model: A standard Decision Transformer (DT) processes trajectories as token sequences (state, action, return‑to‑go).
  2. TinySubNetworks: For each incoming task, a binary mask is learned that activates only a small fraction of the DT’s weights, forming a task‑specific subnetwork. The rest of the parameters stay shared across tasks.
  3. Affinity Scoring:
    • Action Compatibility: measures how similar the action distributions of two tasks are (e.g., both require “move left”).
    • Latent Similarity: computes cosine similarity between the hidden representations of states from different tasks.
      The combined score determines whether a new task can reuse an existing subnetwork or should spawn a fresh one.
  4. Routing at Inference: When the agent receives a state, it evaluates affinity scores against all existing subnetworks and selects the one with the highest compatibility, effectively “routing” the decision through the most relevant parameter set.
  5. Training Loop: Offline datasets are processed sequentially. For each task, only its designated subnetwork is updated, while shared weights receive gradients from all tasks, encouraging knowledge transfer without overwriting task‑specific nuances.

Results & Findings

BenchmarkReplay‑CL (baseline)TSN‑Affinity (ours)Retention (Δ after 5 tasks)
Atari (10 games)78 % avg. score84 % avg. score+12 %
Franka Panda (pick‑place)0.62 success rate0.71 success rate+15 %
  • Retention: After learning five tasks, TSN‑Affinity loses <5 % of performance on earlier tasks, whereas replay methods drop >15 %.
  • Parameter Efficiency: Each subnetwork uses ~8 % of the full model’s parameters; total memory grows linearly but remains modest (≈1.4× the base DT after ten tasks).
  • Routing Gains: Adding the affinity‑based router improves multi‑task scores by ~4 % over a naïve “first‑match” subnetwork selection.
  • Training Speed: Because only a sparse mask is updated per task, per‑task training time drops ~30 % compared to full‑model fine‑tuning.

Practical Implications

  • Robotics & Edge Devices: Companies can continuously upgrade a robot’s repertoire from logged sensor logs without pulling it out of production, while keeping the firmware footprint low.
  • Safety‑Critical Systems: In domains like autonomous driving, where online exploration is risky, TSN‑Affinity enables incremental policy updates from simulation or fleet data without catastrophic forgetting.
  • Resource‑Constrained Cloud Services: SaaS platforms offering RL‑as‑a‑service can host many client‑specific policies in a single model, reducing GPU memory and inference latency through subnetwork routing.
  • Simplified Deployment Pipelines: No need to maintain large replay buffers or perform costly data shuffling; new tasks are added by training a tiny mask and updating the shared backbone.

Limitations & Future Work

  • Scalability of Affinity Computation: As the number of tasks grows, evaluating similarity against all existing subnetworks may become a bottleneck; approximate nearest‑neighbor methods are a possible remedy.
  • Task Similarity Assumption: The routing relies on meaningful latent similarity; highly divergent tasks (e.g., vision‑based navigation vs. pure control) may still require separate large subnetworks, limiting parameter sharing.
  • Offline Dataset Quality: Like any offline RL method, performance hinges on the coverage and quality of the logged trajectories; noisy or biased logs can degrade the affinity scores.
  • Future Directions: Extending TSN‑Affinity to meta‑learning scenarios where the model can quickly infer a new mask from a few demonstrations, and exploring hierarchical routing (grouping tasks into clusters) to keep affinity checks tractable.

If you’re interested in trying out TSN‑Affinity, the authors have made the code publicly available on GitHub. The approach opens a promising path toward truly continual, offline‑learning agents that can evolve safely and efficiently in real‑world deployments.

Authors

  • Dominik Żurek
  • Kamil Faber
  • Marcin Pietron
  • Paweł Gajewski
  • Roberto Corizzo

Paper Information

  • arXiv ID: 2604.25898v1
  • Categories: cs.LG, cs.AI
  • Published: April 28, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »

[Paper] Recursive Multi-Agent Systems

Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen ...