[Paper] Routing the Lottery: Adaptive Subnetworks for Heterogeneous Data
Source: arXiv - 2601.22141v1
Overview
The paper introduces Routing the Lottery (RTL), a new pruning framework that moves beyond the classic “one‑size‑fits‑all” lottery ticket hypothesis. Instead of searching for a single sparse subnetwork that works for every input, RTL learns a portfolio of adaptive tickets—each specialized for a particular class, semantic cluster, or environmental condition. The result is a modular, context‑aware model that delivers higher accuracy with dramatically fewer parameters.
Key Contributions
- Adaptive tickets: A method to discover multiple, data‑dependent sparse subnetworks rather than a universal one.
- Routing mechanism: A lightweight selector that routes each input to its most suitable ticket at inference time.
- Subnetwork collapse analysis: Identification of a failure mode where aggressive pruning causes tickets to lose discriminative power.
- Subnetwork similarity score: A label‑free metric that flags oversparsification before performance degrades.
- Empirical gains: Across image classification, object detection, and domain‑shift benchmarks, RTL achieves up to 10× parameter reduction compared with training separate models, while improving balanced accuracy and recall.
Methodology
- Base network & initial pruning: Start with a dense backbone (e.g., ResNet‑50) and apply magnitude‑based pruning to obtain an initial sparse mask.
- Ticket diversification: Using a small clustering step on either class labels or learned feature embeddings, the data is split into K groups (e.g., per class or per domain). For each group, RTL fine‑tunes a separate mask while keeping the shared backbone weights frozen. This yields K adaptive tickets that differ mainly in which connections are kept.
- Routing module: A shallow gating network (often a single linear layer followed by softmax) takes the same input and predicts which ticket should process it. The routing decision is trained jointly with the tickets using a cross‑entropy loss plus a sparsity regularizer.
- Training loop:
- Forward pass → routing → selected ticket → loss.
- Back‑prop updates both the routing parameters and the mask scores for the active ticket.
- Periodically, masks are binarized (0/1) based on a global sparsity budget.
- Diagnosis tools: The subnetwork similarity score computes pairwise overlap of binary masks; a sudden drop signals subnetwork collapse, prompting a relaxation of the sparsity target.
The whole pipeline is compatible with standard deep‑learning libraries and adds only a modest overhead (the routing net is <1 % of total FLOPs).
Results & Findings
| Dataset / Task | Baseline (single ticket) | RTL (K=5) | Parameter Savings |
|---|---|---|---|
| CIFAR‑100 (classification) | 73.2 % acc | 77.8 % acc | 9.3× fewer params |
| Cityscapes (semantic seg.) | 71.5 % mIoU | 74.2 % mIoU | 7.8× fewer params |
| DomainNet (multi‑domain) | 62.1 % avg acc | 66.4 % avg acc | 10.2× fewer params |
- Balanced accuracy improves especially on under‑represented classes, indicating that tickets specialize to capture minority patterns.
- Recall gains are consistent across tasks, showing that RTL reduces false negatives caused by over‑pruning.
- The subnetwork similarity score successfully predicts collapse: when the score falls below a learned threshold, early‑stopping or sparsity relaxation restores performance.
Practical Implications
- Edge & mobile deployment: Developers can ship a single compact model that dynamically activates the appropriate ticket, avoiding the storage and maintenance cost of multiple specialized models.
- Continual learning & domain adaptation: New tickets can be added for emerging data clusters without retraining the entire network, facilitating modular updates.
- Interpretability: Since tickets align with semantic groups, engineers can inspect which parts of the network are responsible for specific classes or conditions, aiding debugging and fairness audits.
- Resource‑aware inference: The routing decision can be conditioned on device constraints (e.g., low‑power mode) to select a lighter ticket, offering graceful degradation.
Limitations & Future Work
- Routing overhead: Although small, the routing network adds latency; scaling to thousands of tickets may require more efficient selectors.
- Cluster definition: RTL relies on a reasonable grouping of data; poor clustering can lead to redundant tickets or suboptimal specialization.
- Training stability: Joint optimization of masks and routing can be sensitive to hyper‑parameters, especially the sparsity schedule.
- Future directions: The authors suggest exploring hierarchical routing (coarse‑to‑fine ticket selection), integrating RTL with neural architecture search, and extending the similarity diagnostics to unsupervised settings.
Routing the Lottery reframes pruning from a static compression technique into a dynamic, data‑aware strategy—opening the door for more modular, efficient, and adaptable deep‑learning systems in production environments.
Authors
- Grzegorz Stefanski
- Alberto Presta
- Michal Byra
Paper Information
- arXiv ID: 2601.22141v1
- Categories: cs.AI, cs.CV, cs.LG
- Published: January 29, 2026
- PDF: Download PDF