[Paper] Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data

Published: (January 29, 2026 at 01:56 PM EST)
4 min read
Source: arXiv

Source: arXiv - 2601.22141v1

Overview

The paper introduces Routing the Lottery (RTL), a new pruning framework that moves beyond the classic “one‑size‑fits‑all” lottery ticket hypothesis. Instead of searching for a single sparse subnetwork that works for every input, RTL learns a portfolio of adaptive tickets—each specialized for a particular class, semantic cluster, or environmental condition. The result is a modular, context‑aware model that delivers higher accuracy with dramatically fewer parameters.

Key Contributions

  • Adaptive tickets: A method to discover multiple, data‑dependent sparse subnetworks rather than a universal one.
  • Routing mechanism: A lightweight selector that routes each input to its most suitable ticket at inference time.
  • Subnetwork collapse analysis: Identification of a failure mode where aggressive pruning causes tickets to lose discriminative power.
  • Subnetwork similarity score: A label‑free metric that flags oversparsification before performance degrades.
  • Empirical gains: Across image classification, object detection, and domain‑shift benchmarks, RTL achieves up to 10× parameter reduction compared with training separate models, while improving balanced accuracy and recall.

Methodology

  1. Base network & initial pruning: Start with a dense backbone (e.g., ResNet‑50) and apply magnitude‑based pruning to obtain an initial sparse mask.
  2. Ticket diversification: Using a small clustering step on either class labels or learned feature embeddings, the data is split into K groups (e.g., per class or per domain). For each group, RTL fine‑tunes a separate mask while keeping the shared backbone weights frozen. This yields K adaptive tickets that differ mainly in which connections are kept.
  3. Routing module: A shallow gating network (often a single linear layer followed by softmax) takes the same input and predicts which ticket should process it. The routing decision is trained jointly with the tickets using a cross‑entropy loss plus a sparsity regularizer.
  4. Training loop:
    • Forward pass → routing → selected ticket → loss.
    • Back‑prop updates both the routing parameters and the mask scores for the active ticket.
    • Periodically, masks are binarized (0/1) based on a global sparsity budget.
  5. Diagnosis tools: The subnetwork similarity score computes pairwise overlap of binary masks; a sudden drop signals subnetwork collapse, prompting a relaxation of the sparsity target.

The whole pipeline is compatible with standard deep‑learning libraries and adds only a modest overhead (the routing net is <1 % of total FLOPs).

Results & Findings

Dataset / TaskBaseline (single ticket)RTL (K=5)Parameter Savings
CIFAR‑100 (classification)73.2 % acc77.8 % acc9.3× fewer params
Cityscapes (semantic seg.)71.5 % mIoU74.2 % mIoU7.8× fewer params
DomainNet (multi‑domain)62.1 % avg acc66.4 % avg acc10.2× fewer params
  • Balanced accuracy improves especially on under‑represented classes, indicating that tickets specialize to capture minority patterns.
  • Recall gains are consistent across tasks, showing that RTL reduces false negatives caused by over‑pruning.
  • The subnetwork similarity score successfully predicts collapse: when the score falls below a learned threshold, early‑stopping or sparsity relaxation restores performance.

Practical Implications

  • Edge & mobile deployment: Developers can ship a single compact model that dynamically activates the appropriate ticket, avoiding the storage and maintenance cost of multiple specialized models.
  • Continual learning & domain adaptation: New tickets can be added for emerging data clusters without retraining the entire network, facilitating modular updates.
  • Interpretability: Since tickets align with semantic groups, engineers can inspect which parts of the network are responsible for specific classes or conditions, aiding debugging and fairness audits.
  • Resource‑aware inference: The routing decision can be conditioned on device constraints (e.g., low‑power mode) to select a lighter ticket, offering graceful degradation.

Limitations & Future Work

  • Routing overhead: Although small, the routing network adds latency; scaling to thousands of tickets may require more efficient selectors.
  • Cluster definition: RTL relies on a reasonable grouping of data; poor clustering can lead to redundant tickets or suboptimal specialization.
  • Training stability: Joint optimization of masks and routing can be sensitive to hyper‑parameters, especially the sparsity schedule.
  • Future directions: The authors suggest exploring hierarchical routing (coarse‑to‑fine ticket selection), integrating RTL with neural architecture search, and extending the similarity diagnostics to unsupervised settings.

Routing the Lottery reframes pruning from a static compression technique into a dynamic, data‑aware strategy—opening the door for more modular, efficient, and adaptable deep‑learning systems in production environments.

Authors

  • Grzegorz Stefanski
  • Alberto Presta
  • Michal Byra

Paper Information

  • arXiv ID: 2601.22141v1
  • Categories: cs.AI, cs.CV, cs.LG
  • Published: January 29, 2026
  • PDF: Download PDF
Back to Blog

Related posts

Read more »