🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster

Published: (January 9, 2026 at 12:46 PM EST)
3 min read
Source: Dev.to

Source: Dev.to

TL;DR

Removal strategySpeed ↑Perplexity ΔQuality ΔWorks?
Baseline (no removal)1.82
Remove middle layer #11+10 % (59 → 64 tok/s)1.89 (+4 %)–4 %
Remove 3 middle layers #10‑12+12 % (59 → 66 tok/s)2.24 (+23 %)–23 %
Remove first layer #0+10 % (59 → 64 tok/s)5.74 (+215 %)–215 %
Remove 7 “safe” layers (3, 4, 5, 9, 10, 11, 12)+30 % (59 → 77 tok/s)~1.87 (≈ ‑2.5 %)–2.5 %

All measurements are averages of 10 runs (5 warm‑up) on an MPS backend.

Motivation

Start‑ups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700 k per day on compute alone. Any optimisation that speeds up a model without a noticeable quality loss translates directly into cost savings.

Layer pruning is a simple, hardware‑agnostic way to achieve this:

  • Modern models have dozens (or hundreds) of layers (GPT‑4 ≈ 120+).
  • Not all layers contribute equally to final performance.
  • Some can be removed while the model “barely notices”.

Research ShortGPT (2024) showed that up to 25 % of layers can be dropped from LLaMA‑2 with

Note: The “Aggressive” setting is shown for completeness; quality deteriorates quickly beyond the balanced configuration.

Closing Thoughts

  • Early layers encode positional information and basic token relationships—removing them is disastrous.
  • The second layer appears to be a “crystallisation point” for language patterns, making it unexpectedly crucial.
  • A sizable chunk of the middle‑to‑late layers is redundant for this small model, offering a low‑effort path to faster inference.

Future work could explore dynamic pruning (activating/deactivating layers per‑prompt) or knowledge‑distillation to bake the redundant layers’ contributions into a slimmer architecture.

All code and raw measurement logs are available on my public GitHub repository (link omitted for brevity).

edup

Pruning Results

StrategyRemoved LayersSpeed‑upQuality loss
Minimal{3}~0.4 %~5 %
Moderate{3, 5, 10, 11}~1 %~18 %
Aggressive{3, 4, 5, 9, 10, 11, 12}~2.5 %~32 %

Optimal strategy: remove least important layers

# Layers whose PPL increase
# **Important:** Never remove layers 0, 2, 15 – they are critical points.
YearProjectFocus
2024ShortGPTRemoving entire layers
2024FinerCutRemoving components within layers
2024SliceGPTRemoving rows/columns from weight matrices
2025LinearPatchRecovering 94 % quality after pruning via Hadamard transform (arXiv)
2025MRP (Maximum Redundancy Pruning)Adaptive removal of most redundant layers (arXiv)
2025CLP (Automatic segment search)Finding optimal segments to remove (arXiv)

Combining pruning with quantisation (INT4/INT8) can yield even greater speed‑ups.

Business impact

  • Cost saving: For a $10 k/month inference GPU budget, pruning can save $2–3 k without noticeable quality loss.
  • Scale: At OpenAI’s scale, this translates to millions of dollars.

Caveats & considerations

  • Model size: Results shown for TinyLlama 1.1B; may differ for 7 B / 70 B models.
  • Metric limitation: Perplexity does not capture all quality aspects.
  • Fine‑tuning: Post‑pruning fine‑tuning can recover some lost quality.
  • Dataset diversity: Experiments were run on a single dataset; broader testing is needed.
  • Measurement variance: Speed on MPS backend varies ±10 %; run many trials for reliable numbers.
  • Chain‑of‑thought degradation: Recent work (arXiv 2510.22228) shows that removing even 1–2 layers can break multi‑step reasoning, while simple tasks remain unaffected.

Reproducibility

All experiment code is available on GitLab:

git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick

Key insights

  • Layer 2 is unexpectedly the most important (more so than Layer 0).
  • Layers 3‑5 and 9‑12 are largely redundant and can be removed with minimal impact.
  • Layer 15 is a hidden critical layer in the later part of the network.
  • Practical result: Removing 7 layers (22 → 15) yields ~32 % speed‑up with ~2.5 % quality loss.

Next steps

  1. Run the same pipeline on Llama‑3 8B for stronger validation.
  2. Explore pruning + quantisation combinations.
  3. Investigate what critical layers (2 & 15) actually encode.

If you liked this, subscribe, star the GitLab repo, and share with colleagues.

Questions and suggestions? Drop a comment or DM.

Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning

Back to Blog

Related posts

Read more »

GLM-4.7-Flash

Article URL: https://huggingface.co/zai-org/GLM-4.7-Flash Comments URL: https://news.ycombinator.com/item?id=46679872 Points: 69 Comments: 11...