[Paper] Do Depth-Grown Models Overcome the Curse of Depth? An In-Depth Analysis

Published: 2 months ago (December 9, 2025 at 12:12 PM EST)

4 min read

Source: arXiv

Source: arXiv - 2512.08819v1

Overview

The paper investigates why gradually increasing the depth of Transformer models during training—a technique popularized by MIDAS—yields both cheaper training and better reasoning performance. By linking this phenomenon to the “curse of depth” (the observation that deeper layers in standard Transformers contribute little to the final output), the authors show that depth‑grown models actually make better use of their layers, reshaping the residual stream and forming reusable computational blocks.

Key Contributions

Empirical link between depth‑grown training (MIDAS) and mitigation of the curse of depth in Transformers.
Depth‑wise analysis revealing that middle‑stack growth leads to higher activation and gradient flow in later layers compared to static‑depth models.
Discovery of altered residual‑stream dynamics: grown models develop permutable computational blocks that can be re‑ordered without harming performance.
Lightweight MIDAS‑plus modification (a simple schedule tweak) that consistently improves downstream reasoning benchmarks (e.g., Logical Entailment, ProofWriter).
Comprehensive ablation suite that isolates the effect of growth schedule, layer‑norm placement, and residual scaling on depth utilization.

Methodology

Model Families

The authors train three families of Transformer encoders on the same language‑modeling corpus:

Static – conventional depth (e.g., 24 layers) from scratch.
MIDAS – depth is increased step‑wise by inserting new layers in the middle of the network during training.
MIDAS‑plus – same as MIDAS but with a minor residual‑scaling tweak (α‑schedule).

Depth‑wise Probing

For each checkpoint they compute:

Layer contribution – change in output logits when a layer’s output is zeroed out.
Gradient magnitude – average ℓ₂ norm of back‑propagated gradients per layer.
Residual stream similarity – cosine similarity of the hidden state before and after each residual addition.

Circuit Identification

Using clustering on activation patterns, they detect permutable blocks: groups of consecutive layers whose internal representations are highly interchangeable across training runs.

Benchmarks

All models are evaluated on a suite of reasoning tasks (e.g., GSM‑8K, MathQA, and logical deduction datasets) to quantify downstream impact.
The pipeline is deliberately kept simple: standard AdamW optimizer, same data schedule, and only the growth schedule differs, making the findings easy to reproduce.

Results & Findings

Metric	Static	MIDAS	MIDAS‑plus
Average layer contribution (last 12 layers)	0.12 × baseline	0.48 × baseline	0.55 × baseline
Mean gradient norm (deep layers)	0.03	0.11	0.13
Residual‑stream cosine drift	0.21	0.57	0.62
Reasoning benchmark avg. accuracy	71.3 %	78.9 %	80.5 %

Deeper layers become useful: In static models, the second half of the network contributes <15 % of the output signal, confirming the curse of depth. MIDAS lifts this to ~50 %, and MIDAS‑plus pushes it further.
Residual stream reshaping: Similarity analysis shows that grown models maintain richer, more diverse residual updates, which correlates with higher gradient flow.
Permutable blocks: Clustering reveals 3–4 stable blocks that can be shuffled without degrading performance, hinting at modular computation—something static models rarely exhibit.
Benchmark gains: The modest architectural tweak (α‑schedule) adds ~1.5 % absolute accuracy on reasoning tasks, demonstrating that the depth‑growth effect is not just theoretical.

Practical Implications

Cost‑effective scaling – Teams can train deeper Transformers without linearly increasing GPU hours; mid‑training depth insertion reduces total FLOPs by ~30 % while still delivering stronger models.
Better fine‑tuning – Since later layers are now information‑rich, fine‑tuning on downstream tasks (especially those requiring multi‑step reasoning) benefits from fewer frozen layers, simplifying transfer‑learning pipelines.
Modular model design – The emergence of permutable blocks opens the door to plug‑and‑play model components (e.g., swapping a reasoning block for a domain‑specific one without retraining the whole network).
Debugging & interpretability – Depth‑wise contribution metrics become more meaningful when all layers matter, aiding developers in pinpointing failure modes or bottlenecks.
Framework support – Implementing MIDAS‑plus requires only a scheduler that inserts layers and adjusts residual scaling—features that can be added to popular libraries (PyTorch Lightning, Hugging Face Trainer) with minimal code changes.

Limitations & Future Work

Scope of architectures – Experiments focus on encoder‑only Transformers; it remains unclear how decoder‑heavy or encoder‑decoder models (e.g., LLaMA, T5) behave under depth growth.
Growth schedule rigidity – The paper tests a fixed middle‑stack insertion schedule; adaptive schedules (e.g., based on validation loss) could yield further gains but were not explored.
Hardware constraints – While FLOP savings are reported, actual wall‑time reductions depend on the ability to dynamically re‑allocate GPU memory—a non‑trivial engineering challenge on some platforms.
Theoretical grounding – The connection between permutable blocks and formal notions of circuit modularity is empirical; a rigorous theory could guide automated block discovery.

Future research directions include extending depth‑grown training to multimodal Transformers, automating block detection for model compression, and integrating growth schedules with sparsity or mixture‑of‑experts techniques.

Authors

Ferdinand Kapl
Emmanouil Angelis
Tobias Höppe
Kaitlin Maile
Johannes von Oswald
Nino Scherrer
Stefan Bauer

Paper Information

arXiv ID: 2512.08819v1
Categories: cs.CL, cs.AI, cs.LG
Published: December 9, 2025
PDF: Download PDF