[Paper] ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Published: 3 days ago (February 9, 2026 at 01:54 PM EST)

5 min read

Source: arXiv

Source: arXiv - 2602.09009v1

Overview

The paper “ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling” challenges the long‑standing assumption that the static residual wiring used in deep networks is optimal. By treating residual connections as learnable parameters, the authors show that a modest, data‑driven re‑configuration can dramatically speed up training and squeeze more performance out of very deep models—without any noticeable increase in compute or memory.

Key Contributions

Theoretical insight: Proves that the pattern of residual connections directly influences convergence speed, and that sub‑optimal layouts can cause an exponential slowdown.
ANCRe framework: Introduces a lightweight, differentiable parameterization of residual links that can be learned jointly with model weights.
Negligible overhead: The extra parameters and operations add < 1 % to FLOPs and memory, making ANCRe practical for large‑scale training.
Broad empirical validation: Demonstrates faster convergence and higher final accuracy on three fronts:
- Pre‑training of large language models (LLMs)
- Diffusion models for image synthesis
- Deep ResNet classifiers (up to 200 layers)
Depth efficiency: Shows that the same performance can be achieved with fewer effective layers when ANCRe is applied, opening the door to slimmer, faster models.

Methodology

Parameterizing connections – Each residual edge is assigned a scalar gating variable (g_{ij}) (or a low‑dimensional vector) that multiplies the skip‑connection output.
Learning the gates – The gates are optimized together with the network weights using standard back‑propagation. A simple L2 regularizer encourages sparsity, letting the model “turn off” unnecessary shortcuts.
Adaptive reassignment – During training, the optimizer continuously updates the gates, effectively rewiring the network on‑the‑fly. Because the gates are differentiable, no extra graph‑search or reinforcement‑learning loop is needed.
Implementation tricks –
- Gates are stored as a tiny tensor (one per residual block) and broadcasted, so the memory impact is minimal.
- A custom CUDA kernel applies the gating with virtually no extra latency.
- The method plugs into existing frameworks (PyTorch, JAX) by wrapping the standard nn.Module/nn.Layer classes.

The overall training pipeline remains unchanged: data loading, optimizer steps, and learning‑rate schedules are all as usual; ANCRe simply adds a few extra parameters that the optimizer treats like any other weight.

Results & Findings

Benchmark	Baseline (static residual)	ANCRe (adaptive)	Speed‑up (epochs)	Final metric improvement
LLM (GPT‑like, 1.3B)	20.4 B tokens to reach 0.5 B perplexity	15.8 B tokens	~22 % fewer epochs	+0.4 % lower perplexity
Diffusion (DDPM, 256×256)	500k steps to FID = 12.3	380k steps	~24 % faster	FID = 11.5 (‑0.8)
ResNet‑200 (ImageNet)	76.3 % top‑1 accuracy after 90 epochs	77.1 % after 70 epochs	~22 % fewer epochs	+0.8 % top‑1

Key takeaways

Convergence acceleration is consistent across domains, confirming the theoretical claim that better residual layouts shrink the optimization landscape.
Performance gains are modest but statistically significant, especially when training budgets are tight.
Depth utilization improves: visualizations of the learned gates reveal that early layers tend to keep many shortcuts, while deeper layers prune redundant connections, effectively “compressing” depth.

Practical Implications

Faster model iteration: Teams can cut pre‑training time by ~20 % without buying extra hardware—valuable for rapid prototyping of LLMs or diffusion models.
Cost savings: Reduced epochs translate directly into lower cloud‑compute bills, especially for multi‑billion‑parameter training runs.
Model slimming: By identifying and disabling unnecessary residual paths, developers can produce leaner inference graphs that run faster on edge devices or mobile GPUs.
Plug‑and‑play adoption: Since ANCRe is a thin wrapper around existing residual blocks, integrating it into current codebases (PyTorch, TensorFlow, JAX) requires only a few lines of change.
Potential for automated architecture search: ANCRe’s differentiable gating can be combined with NAS pipelines to explore even richer connectivity patterns without a separate search phase.

Limitations & Future Work

Scope of experiments: The paper focuses on vision and language backbones; applicability to other domains (e.g., speech, reinforcement learning) remains to be verified.
Gate regularization sensitivity: Choosing the right sparsity strength is still a hyper‑parameter; overly aggressive regularization can prune useful shortcuts and hurt performance.
Theoretical gap: While convergence bounds are proved for simplified linear models, extending the analysis to full non‑linear deep nets is an open challenge.
Future directions suggested by the authors include:
- Exploring multi‑dimensional gating (e.g., channel‑wise or spatial‑wise) for finer‑grained adaptation.
- Combining ANCRe with dynamic depth techniques (early‑exit, layer dropping) for even greater efficiency.
- Investigating the interaction between adaptive connections and modern optimizers (AdamW, LAMB) at extreme scales.

Bottom line: ANCRe offers a surprisingly simple yet powerful lever—learnable residual connections—that can shave weeks off large‑scale training and make deep networks more resource‑efficient. For developers building the next generation of foundation models, it’s a low‑cost upgrade worth trying.

Authors

Yilang Zhang
Bingcong Li
Niao He
Georgios B. Giannakis

Paper Information

arXiv ID: 2602.09009v1
Categories: cs.LG, cs.AI
Published: February 9, 2026
PDF: Download PDF

[Paper] ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Overview

Key Contributions

Methodology

Results & Findings

Practical Implications

Limitations & Future Work

Authors

Paper Information

Related posts

[Paper] Diffusion-Pretrained Dense and Contextual Embeddings

[Paper] YOR: Your Own Mobile Manipulator for Generalizable Robotics

[Paper] Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

[Paper] SCRAPL: Scattering Transform with Random Paths for Machine Learning