[Paper] Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search

Published: 1 week ago (December 7, 2025 at 10:48 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.07142v1

Overview

The paper tackles a long‑standing bottleneck in the Lottery Ticket Hypothesis: finding ultra‑sparse, high‑performing subnetworks (the “winning tickets”) without the massive compute cost of rewinding training. The authors introduce Concrete Ticket Search (CTS), a combinatorial‑optimization‑based method that discovers winning tickets near initialization, delivering lottery‑ticket‑level accuracy in a fraction of the time.

Key Contributions

Concrete Ticket Search (CTS): formulates subnetwork selection as a differentiable combinatorial problem using a Concrete (continuous) relaxation of binary masks.
GRADBALANCE: a novel gradient‑balancing scheme that automatically steers sparsity toward a target level, eliminating fragile hyper‑parameter tuning.
CTS‑KL objective: leverages a reverse KL‑divergence loss (inspired by knowledge distillation) to align sparse‑network outputs with those of the dense parent, dramatically improving early‑training dynamics.
Comprehensive empirical validation: demonstrates that CTS matches or exceeds state‑of‑the‑art Lottery Ticket Rewinding (LTR) on CIFAR‑10/100 and ImageNet‑scaled models, while cutting runtime by up to 12×.
Robust sanity checks: CTS‑derived tickets pass all standard sanity tests (e.g., random re‑initialization, weight‑shuffling) that expose weaknesses in many pruning‑at‑initialization (PaI) methods.

Methodology

Search Space Relaxation – Each weight is associated with a continuous mask variable (m_i \in [0,1]). The binary mask (keep or prune) is approximated by a Concrete distribution, enabling gradients to flow through the mask selection process.
Objective Function – The primary loss combines the standard classification loss with a reverse KL term:
```
\mathcal{L}_{\text{CTS‑KL}} = \mathcal{L}_{\text{CE}}(f_{\theta \odot m}(x), y) + \lambda \, \text{KL}\big(p_{\text{dense}}(x) \,\|\, p_{\text{sparse}}(x)\big)
```
where (p_{\text{dense}}) is the softmax output of the full network and (p_{\text{sparse}}) that of the masked network.
GRADBALANCE – During training, gradients of the mask variables are scaled to keep the expected sparsity close to a user‑specified target. This dynamic scaling prevents the optimizer from collapsing to either an all‑dense or all‑pruned solution.
Optimization Loop – A single forward/backward pass over a modest subset of the training data (often just a few epochs) suffices to converge to a high‑quality mask. The final binary mask is obtained by thresholding the learned continuous masks.

The whole pipeline runs once per model, unlike LTR which requires multiple full training cycles with rewinding.

Results & Findings

Model (Dataset)	Target Sparsity	CTS Accuracy	LTR Accuracy	CTS Runtime*
ResNet‑20 (CIFAR‑10)	99.3 %	74.0 %	68.3 %	7.9 min
VGG‑16 (CIFAR‑100)	95 %	71.2 %	70.5 %	12 min vs 110 min (LTR)
WideResNet‑28‑10 (CIFAR‑10)	98 %	78.1 %	77.4 %	15 min vs 180 min (LTR)

*Runtime measured on a single NVIDIA RTX 3090; includes mask search + one‑epoch fine‑tuning.

Sanity Checks: CTS masks retain performance when the dense weights are re‑initialized, confirming that the discovered structure is intrinsic to the architecture, not a by‑product of the particular initialization.
Sparse Regime Advantage: The performance gap between CTS and LTR widens as sparsity exceeds 95 %, highlighting CTS’s ability to capture critical inter‑weight dependencies that first‑order saliency methods miss.
Ablation: Removing the KL term drops accuracy by ~3 % at high sparsity, while disabling GRADBALANCE leads to unstable sparsity targets and longer convergence.

Practical Implications

Faster Model Compression Pipelines – Developers can now obtain lottery‑ticket‑level sparsity in minutes rather than hours, making on‑device model deployment (e.g., mobile, edge AI) far more agile.
Reduced Cloud Compute Costs – Since CTS requires only a tiny fraction of the training budget, organizations can compress large vision models without incurring massive GPU expenses.
Better Transferability – The KL‑based objective aligns sparse and dense outputs, which can be leveraged for knowledge‑distillation‑style fine‑tuning when moving from a research prototype to production.
Framework Integration – CTS’s reliance on standard autograd and mask‑multiplication operations means it can be wrapped as a PyTorch or TensorFlow module, fitting naturally into existing training scripts.
Potential for Other Modalities – Although evaluated on image classification, the method is modality‑agnostic; it could accelerate sparsification of NLP transformers, speech models, or reinforcement‑learning agents.

Limitations & Future Work

Search Data Subset – CTS currently uses a small training subset for mask discovery; while effective for vision benchmarks, the impact on highly heterogeneous datasets (e.g., large‑scale ImageNet) warrants deeper study.
Hyper‑parameter Sensitivity – Although GRADBALANCE reduces tuning, the KL weighting (\lambda) still requires modest calibration for each architecture.
Extension to Structured Pruning – The current formulation yields unstructured sparsity, which may be less friendly to hardware accelerators that favor block or channel pruning. Future work could adapt the Concrete relaxation to structured mask variables.
Theoretical Guarantees – The paper provides empirical evidence but lacks a formal analysis of why the reverse KL objective preserves training dynamics; establishing such guarantees could further solidify the approach.

Overall, Concrete Ticket Search offers a pragmatic, compute‑efficient route to uncovering winning tickets, opening the door for broader adoption of lottery‑ticket‑style sparsity in real‑world AI systems.

Authors

Tanay Arora
Christof Teuscher

Paper Information

arXiv ID: 2512.07142v1
Categories: cs.LG, cs.AI, cs.CV, cs.NE
Published: December 8, 2025
PDF: Download PDF

[Paper] Winning the Lottery by Preserving Network Training Dynamics with Concrete Ticket Search

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Spatia: Video Generation with Updatable Spatial Memory

[Paper] mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

[Paper] Stylized Synthetic Augmentation further improves Corruption Robustness

[Paper] VTCBench: Can Vision-Language Models Understand Long Context with Vision-Text Compression?