[Paper] ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling

Published: (February 9, 2026 at 01:54 PM EST)
5 min read
Source: arXiv

Source: arXiv - 2602.09009v1

Overview

The paper “ANCRe: Adaptive Neural Connection Reassignment for Efficient Depth Scaling” challenges the long‑standing assumption that the static residual wiring used in deep networks is optimal. By treating residual connections as learnable parameters, the authors show that a modest, data‑driven re‑configuration can dramatically speed up training and squeeze more performance out of very deep models—without any noticeable increase in compute or memory.

Key Contributions

  • Theoretical insight: Proves that the pattern of residual connections directly influences convergence speed, and that sub‑optimal layouts can cause an exponential slowdown.
  • ANCRe framework: Introduces a lightweight, differentiable parameterization of residual links that can be learned jointly with model weights.
  • Negligible overhead: The extra parameters and operations add < 1 % to FLOPs and memory, making ANCRe practical for large‑scale training.
  • Broad empirical validation: Demonstrates faster convergence and higher final accuracy on three fronts:
    • Pre‑training of large language models (LLMs)
    • Diffusion models for image synthesis
    • Deep ResNet classifiers (up to 200 layers)
  • Depth efficiency: Shows that the same performance can be achieved with fewer effective layers when ANCRe is applied, opening the door to slimmer, faster models.

Methodology

  1. Parameterizing connections – Each residual edge is assigned a scalar gating variable (g_{ij}) (or a low‑dimensional vector) that multiplies the skip‑connection output.
  2. Learning the gates – The gates are optimized together with the network weights using standard back‑propagation. A simple L2 regularizer encourages sparsity, letting the model “turn off” unnecessary shortcuts.
  3. Adaptive reassignment – During training, the optimizer continuously updates the gates, effectively rewiring the network on‑the‑fly. Because the gates are differentiable, no extra graph‑search or reinforcement‑learning loop is needed.
  4. Implementation tricks
    • Gates are stored as a tiny tensor (one per residual block) and broadcasted, so the memory impact is minimal.
    • A custom CUDA kernel applies the gating with virtually no extra latency.
    • The method plugs into existing frameworks (PyTorch, JAX) by wrapping the standard nn.Module/nn.Layer classes.

The overall training pipeline remains unchanged: data loading, optimizer steps, and learning‑rate schedules are all as usual; ANCRe simply adds a few extra parameters that the optimizer treats like any other weight.

Results & Findings

BenchmarkBaseline (static residual)ANCRe (adaptive)Speed‑up (epochs)Final metric improvement
LLM (GPT‑like, 1.3B)20.4 B tokens to reach 0.5 B perplexity15.8 B tokens~22 % fewer epochs+0.4 % lower perplexity
Diffusion (DDPM, 256×256)500k steps to FID = 12.3380k steps~24 % fasterFID = 11.5 (‑0.8)
ResNet‑200 (ImageNet)76.3 % top‑1 accuracy after 90 epochs77.1 % after 70 epochs~22 % fewer epochs+0.8 % top‑1

Key takeaways

  • Convergence acceleration is consistent across domains, confirming the theoretical claim that better residual layouts shrink the optimization landscape.
  • Performance gains are modest but statistically significant, especially when training budgets are tight.
  • Depth utilization improves: visualizations of the learned gates reveal that early layers tend to keep many shortcuts, while deeper layers prune redundant connections, effectively “compressing” depth.

Practical Implications

  • Faster model iteration: Teams can cut pre‑training time by ~20 % without buying extra hardware—valuable for rapid prototyping of LLMs or diffusion models.
  • Cost savings: Reduced epochs translate directly into lower cloud‑compute bills, especially for multi‑billion‑parameter training runs.
  • Model slimming: By identifying and disabling unnecessary residual paths, developers can produce leaner inference graphs that run faster on edge devices or mobile GPUs.
  • Plug‑and‑play adoption: Since ANCRe is a thin wrapper around existing residual blocks, integrating it into current codebases (PyTorch, TensorFlow, JAX) requires only a few lines of change.
  • Potential for automated architecture search: ANCRe’s differentiable gating can be combined with NAS pipelines to explore even richer connectivity patterns without a separate search phase.

Limitations & Future Work

  • Scope of experiments: The paper focuses on vision and language backbones; applicability to other domains (e.g., speech, reinforcement learning) remains to be verified.
  • Gate regularization sensitivity: Choosing the right sparsity strength is still a hyper‑parameter; overly aggressive regularization can prune useful shortcuts and hurt performance.
  • Theoretical gap: While convergence bounds are proved for simplified linear models, extending the analysis to full non‑linear deep nets is an open challenge.
  • Future directions suggested by the authors include:
    • Exploring multi‑dimensional gating (e.g., channel‑wise or spatial‑wise) for finer‑grained adaptation.
    • Combining ANCRe with dynamic depth techniques (early‑exit, layer dropping) for even greater efficiency.
    • Investigating the interaction between adaptive connections and modern optimizers (AdamW, LAMB) at extreme scales.

Bottom line: ANCRe offers a surprisingly simple yet powerful lever—learnable residual connections—that can shave weeks off large‑scale training and make deep networks more resource‑efficient. For developers building the next generation of foundation models, it’s a low‑cost upgrade worth trying.

Authors

  • Yilang Zhang
  • Bingcong Li
  • Niao He
  • Georgios B. Giannakis

Paper Information

  • arXiv ID: 2602.09009v1
  • Categories: cs.LG, cs.AI
  • Published: February 9, 2026
  • PDF: Download PDF
0 views
Back to Blog

Related posts

Read more »