🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster
Source: Dev.to
TL;DR
| Removal strategy | Speed ↑ | Perplexity Δ | Quality Δ | Works? |
|---|---|---|---|---|
| Baseline (no removal) | – | 1.82 | — | ✅ |
| Remove middle layer #11 | +10 % (59 → 64 tok/s) | 1.89 (+4 %) | –4 % | ✅ |
| Remove 3 middle layers #10‑12 | +12 % (59 → 66 tok/s) | 2.24 (+23 %) | –23 % | ✅ |
| Remove first layer #0 | +10 % (59 → 64 tok/s) | 5.74 (+215 %) | –215 % | ❌ |
| Remove 7 “safe” layers (3, 4, 5, 9, 10, 11, 12) | +30 % (59 → 77 tok/s) | ~1.87 (≈ ‑2.5 %) | –2.5 % | ✅ |
All measurements are averages of 10 runs (5 warm‑up) on an MPS backend.
Motivation
Start‑ups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700 k per day on compute alone. Any optimisation that speeds up a model without a noticeable quality loss translates directly into cost savings.
Layer pruning is a simple, hardware‑agnostic way to achieve this:
- Modern models have dozens (or hundreds) of layers (GPT‑4 ≈ 120+).
- Not all layers contribute equally to final performance.
- Some can be removed while the model “barely notices”.
Research ShortGPT (2024) showed that up to 25 % of layers can be dropped from LLaMA‑2 with
Note: The “Aggressive” setting is shown for completeness; quality deteriorates quickly beyond the balanced configuration.
Closing Thoughts
- Early layers encode positional information and basic token relationships—removing them is disastrous.
- The second layer appears to be a “crystallisation point” for language patterns, making it unexpectedly crucial.
- A sizable chunk of the middle‑to‑late layers is redundant for this small model, offering a low‑effort path to faster inference.
Future work could explore dynamic pruning (activating/deactivating layers per‑prompt) or knowledge‑distillation to bake the redundant layers’ contributions into a slimmer architecture.
All code and raw measurement logs are available on my public GitHub repository (link omitted for brevity).
edup
Pruning Results
| Strategy | Removed Layers | Speed‑up | Quality loss |
|---|---|---|---|
| Minimal | {3} | ~0.4 % | ~5 % |
| Moderate | {3, 5, 10, 11} | ~1 % | ~18 % |
| Aggressive | {3, 4, 5, 9, 10, 11, 12} | ~2.5 % | ~32 % |
Optimal strategy: remove least important layers
# Layers whose PPL increase
# **Important:** Never remove layers 0, 2, 15 – they are critical points.
Ongoing research & related work
| Year | Project | Focus |
|---|---|---|
| 2024 | ShortGPT | Removing entire layers |
| 2024 | FinerCut | Removing components within layers |
| 2024 | SliceGPT | Removing rows/columns from weight matrices |
| 2025 | LinearPatch | Recovering 94 % quality after pruning via Hadamard transform (arXiv) |
| 2025 | MRP (Maximum Redundancy Pruning) | Adaptive removal of most redundant layers (arXiv) |
| 2025 | CLP (Automatic segment search) | Finding optimal segments to remove (arXiv) |
Combining pruning with quantisation (INT4/INT8) can yield even greater speed‑ups.
Business impact
- Cost saving: For a $10 k/month inference GPU budget, pruning can save $2–3 k without noticeable quality loss.
- Scale: At OpenAI’s scale, this translates to millions of dollars.
Caveats & considerations
- Model size: Results shown for TinyLlama 1.1B; may differ for 7 B / 70 B models.
- Metric limitation: Perplexity does not capture all quality aspects.
- Fine‑tuning: Post‑pruning fine‑tuning can recover some lost quality.
- Dataset diversity: Experiments were run on a single dataset; broader testing is needed.
- Measurement variance: Speed on MPS backend varies ±10 %; run many trials for reliable numbers.
- Chain‑of‑thought degradation: Recent work (arXiv 2510.22228) shows that removing even 1–2 layers can break multi‑step reasoning, while simple tasks remain unaffected.
Reproducibility
All experiment code is available on GitLab:
git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick
Key insights
- Layer 2 is unexpectedly the most important (more so than Layer 0).
- Layers 3‑5 and 9‑12 are largely redundant and can be removed with minimal impact.
- Layer 15 is a hidden critical layer in the later part of the network.
- Practical result: Removing 7 layers (22 → 15) yields ~32 % speed‑up with ~2.5 % quality loss.
Next steps
- Run the same pipeline on Llama‑3 8B for stronger validation.
- Explore pruning + quantisation combinations.
- Investigate what critical layers (2 & 15) actually encode.
If you liked this, subscribe, star the GitLab repo, and share with colleagues.
Questions and suggestions? Drop a comment or DM.
Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning