🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster

Published: 1 month ago (January 9, 2026 at 12:46 PM EST)

3 min read

Source: Dev.to

TL;DR

Removal strategy	Speed ↑	Perplexity Δ	Quality Δ	Works?
Baseline (no removal)	–	1.82	—	✅
Remove middle layer #11	+10 % (59 → 64 tok/s)	1.89 (+4 %)	–4 %	✅
Remove 3 middle layers #10‑12	+12 % (59 → 66 tok/s)	2.24 (+23 %)	–23 %	✅
Remove first layer #0	+10 % (59 → 64 tok/s)	5.74 (+215 %)	–215 %	❌
Remove 7 “safe” layers (3, 4, 5, 9, 10, 11, 12)	+30 % (59 → 77 tok/s)	~1.87 (≈ ‑2.5 %)	–2.5 %	✅

All measurements are averages of 10 runs (5 warm‑up) on an MPS backend.

Motivation

Start‑ups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700 k per day on compute alone. Any optimisation that speeds up a model without a noticeable quality loss translates directly into cost savings.

Layer pruning is a simple, hardware‑agnostic way to achieve this:

Modern models have dozens (or hundreds) of layers (GPT‑4 ≈ 120+).
Not all layers contribute equally to final performance.
Some can be removed while the model “barely notices”.

Research ShortGPT (2024) showed that up to 25 % of layers can be dropped from LLaMA‑2 with

Note: The “Aggressive” setting is shown for completeness; quality deteriorates quickly beyond the balanced configuration.

Closing Thoughts

Early layers encode positional information and basic token relationships—removing them is disastrous.
The second layer appears to be a “crystallisation point” for language patterns, making it unexpectedly crucial.
A sizable chunk of the middle‑to‑late layers is redundant for this small model, offering a low‑effort path to faster inference.

Future work could explore dynamic pruning (activating/deactivating layers per‑prompt) or knowledge‑distillation to bake the redundant layers’ contributions into a slimmer architecture.

All code and raw measurement logs are available on my public GitHub repository (link omitted for brevity).

edup

Pruning Results

Strategy	Removed Layers	Speed‑up	Quality loss
Minimal	`{3}`	~0.4 %	~5 %
Moderate	`{3, 5, 10, 11}`	~1 %	~18 %
Aggressive	`{3, 4, 5, 9, 10, 11, 12}`	~2.5 %	~32 %

Optimal strategy: remove least important layers

# Layers whose PPL increase
# **Important:** Never remove layers 0, 2, 15 – they are critical points.

Year	Project	Focus
2024	ShortGPT	Removing entire layers
2024	FinerCut	Removing components within layers
2024	SliceGPT	Removing rows/columns from weight matrices
2025	LinearPatch	Recovering 94 % quality after pruning via Hadamard transform (arXiv)
2025	MRP (Maximum Redundancy Pruning)	Adaptive removal of most redundant layers (arXiv)
2025	CLP (Automatic segment search)	Finding optimal segments to remove (arXiv)

Combining pruning with quantisation (INT4/INT8) can yield even greater speed‑ups.

Business impact

Cost saving: For a $10 k/month inference GPU budget, pruning can save $2–3 k without noticeable quality loss.
Scale: At OpenAI’s scale, this translates to millions of dollars.

Caveats & considerations

Model size: Results shown for TinyLlama 1.1B; may differ for 7 B / 70 B models.
Metric limitation: Perplexity does not capture all quality aspects.
Fine‑tuning: Post‑pruning fine‑tuning can recover some lost quality.
Dataset diversity: Experiments were run on a single dataset; broader testing is needed.
Measurement variance: Speed on MPS backend varies ±10 %; run many trials for reliable numbers.
Chain‑of‑thought degradation: Recent work (arXiv 2510.22228) shows that removing even 1–2 layers can break multi‑step reasoning, while simple tasks remain unaffected.

Reproducibility

All experiment code is available on GitLab:

git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick

Key insights

Layer 2 is unexpectedly the most important (more so than Layer 0).
Layers 3‑5 and 9‑12 are largely redundant and can be removed with minimal impact.
Layer 15 is a hidden critical layer in the later part of the network.
Practical result: Removing 7 layers (22 → 15) yields ~32 % speed‑up with ~2.5 % quality loss.

Next steps

Run the same pipeline on Llama‑3 8B for stronger validation.
Explore pruning + quantisation combinations.
Investigate what critical layers (2 & 15) actually encode.

If you liked this, subscribe, star the GitLab repo, and share with colleagues.

Questions and suggestions? Drop a comment or DM.

Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning

🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster

TL;DR

Motivation

Closing Thoughts

edup

Pruning Results

Optimal strategy: remove least important layers

Business impact

Caveats & considerations

Reproducibility

Key insights

Next steps

Related posts

Cowork: Claude Code for the rest of your work

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

The `/context` Command: X-Ray Vision for Your Tokens

TimeCapsuleLLM: LLM trained only on data from 1800-1875

TL;DR

Motivation

Closing Thoughts

edup

Pruning Results

Optimal strategy: remove least important layers

Ongoing research & related work

Business impact

Caveats & considerations

Reproducibility

Key insights

Next steps

Related posts

Cowork: Claude Code for the rest of your work

Why your LLM bill is exploding — and how semantic caching can cut it by 73%

The `/context` Command: X-Ray Vision for Your Tokens

TimeCapsuleLLM: LLM trained only on data from 1800-1875