๐ง โ๏ธ Neural Network Lobotomy: LLM์์ 7๊ฐ์ ๋ ์ด์ด๋ฅผ ์ ๊ฑฐ โ 30% ๋ ๋นจ๋ผ์ง
Iโm happy to translate the article for you, but Iโll need the full text of the post (excluding the source line you already provided). Could you please paste the content youโd like translated? Once I have it, Iโll keep the source link unchanged and translate the rest into Korean while preserving the original formatting and any code blocks or URLs.
TL;DR
| ์ ๊ฑฐ ์ ๋ต | ์๋ โ | Perplexity ฮ | ํ์ง ฮ | Works? |
|---|---|---|---|---|
| Baseline (์ ๊ฑฐ ์์) | โ | 1.82 | โ | โ |
| ์ค๊ฐ ๋ ์ด์ด #11 ์ ๊ฑฐ | +10โฏ% (59โฏโโฏ64โฏtok/s) | 1.89 (+4โฏ%) | โ4โฏ% | โ |
| ์ค๊ฐ ๋ ์ด์ด 3๊ฐ #10โ12 ์ ๊ฑฐ | +12โฏ% (59โฏโโฏ66โฏtok/s) | 2.24 (+23โฏ%) | โ23โฏ% | โ |
| ์ฒซ ๋ฒ์งธ ๋ ์ด์ด #0 ์ ๊ฑฐ | +10โฏ% (59โฏโโฏ64โฏtok/s) | 5.74 (+215โฏ%) | โ215โฏ% | โ |
| 7๊ฐ์ โ์์ โ ๋ ์ด์ด ์ ๊ฑฐ (3,โฏ4,โฏ5,โฏ9,โฏ10,โฏ11,โฏ12) | +30โฏ% (59โฏโโฏ77โฏtok/s) | ~1.87 (โโฏโ2.5โฏ%) | โ2.5โฏ% | โ |
๋ชจ๋ ์ธก์ ๊ฐ์ MPS ๋ฐฑ์๋์์ 10ํ ์คํ(5ํ ์๋ฐโ์ ) ํ๊ท ์ ๋๋ค.
Motivation
์คํํธ์ ์ LLM ์ถ๋ก ์ ์ํด GPU์ ์๋ฐฑ๋ง ๋ฌ๋ฌ๋ฅผ ํฌ์ํฉ๋๋ค. OpenAI๋ ํ๋ฃจ์ $700โฏk ์ ๋๋ฅผ ์ปดํจํ ๋น์ฉ๋ง์ผ๋ก ์ฌ์ฉํ๋ค๊ณ ์๋ ค์ก์ต๋๋ค. ํ์ง ์์ค์ด ๋์ ๋์ง ์๋ ๋ชจ๋ธ ๊ฐ์ ์ต์ ํ๋ ์ง์ ์ ์ธ ๋น์ฉ ์ ๊ฐ์ผ๋ก ์ด์ด์ง๋๋ค.
๋ ์ด์ด ํ๋ฃจ๋์ ํ๋์จ์ด์ ๊ตฌ์ ๋ฐ์ง ์๋ ๊ฐ๋จํ ๋ฐฉ๋ฒ์ ๋๋ค:
- ์ต์ ๋ชจ๋ธ์ ์์ญ(๋๋ ์๋ฐฑ) ๊ฐ์ ๋ ์ด์ด๋ฅผ ๊ฐ์ง๊ณ ์์ต๋๋ค (GPTโ4 โโฏ120+).
- ๋ชจ๋ ๋ ์ด์ด๊ฐ ์ต์ข ์ฑ๋ฅ์ ๋์ผํ๊ฒ ๊ธฐ์ฌํ๋ ๊ฒ์ ์๋๋๋ค.
- ์ผ๋ถ ๋ ์ด์ด๋ ๋ชจ๋ธ์ด โ๊ฑฐ์ ๋์น์ฑ์ง ๋ชปํ๊ฒโ ์ ๊ฑฐํ ์ ์์ต๋๋ค.
์ฐ๊ตฌ ShortGPT (2024) ์์๋ LLaMAโ2์์ **์ต๋ 25โฏ%**์ ๋ ์ด์ด๋ฅผ ์ ๊ฑฐํด๋ ๋๋ค๋ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ฌ์ฃผ์์ต๋๋ค
Note: The โAggressiveโ setting is shown for completeness; quality deteriorates quickly beyond the balanced configuration.
๋ง๋ฌด๋ฆฌ ์๊ฐ
- ์ด๊ธฐ ๋ ์ด์ด๋ ์์น ์ ๋ณด์ ๊ธฐ๋ณธ ํ ํฐ ๊ด๊ณ๋ฅผ ์ธ์ฝ๋ฉํฉ๋๋คโ์ด๋ฅผ ์ ๊ฑฐํ๋ฉด ์ฌ์์ด ๋ฉ๋๋ค.
- ๋ ๋ฒ์งธ ๋ ์ด์ด๋ ์ธ์ด ํจํด์ โ๊ฒฐ์ ํ ์ง์ โ์ธ ๊ฒ์ผ๋ก ๋ณด์ด๋ฉฐ, ์์์ธ๋ก ๋งค์ฐ ์ค์ํฉ๋๋ค.
- ์ค๊ฐโํ๋ฐ ๋ ์ด์ด์ ์๋น ๋ถ๋ถ์ ์ด ์์ ๋ชจ๋ธ์ ๋ถํ์ํ์ฌ, ์ ์ ๋ ธ๋ ฅ์ผ๋ก ๋ ๋น ๋ฅธ ์ถ๋ก ์ ๊ฐ๋ฅํ๊ฒ ํฉ๋๋ค.
ํฅํ ์์ ์์๋ ๋์ ํ๋ฃจ๋(ํ๋กฌํํธ๋ณ ๋ ์ด์ด ํ์ฑํ/๋นํ์ฑํ)์ด๋ ์ง์โ์ฆ๋ฅ๋ฅผ ํ๊ตฌํ์ฌ ๋ถํ์ํ ๋ ์ด์ด์ ๊ธฐ์ฌ๋ฅผ ๋ ์ฌ๋ฆผํ ์ํคํ ์ฒ์ ๋ น์ฌ๋ผ ์ ์์ต๋๋ค.
๋ชจ๋ ์ฝ๋์ ์์ ์ธก์ ๋ก๊ทธ๋ ๋ด ๊ณต๊ฐ GitHub ์ ์ฅ์์์ ํ์ธํ ์ ์์ต๋๋ค(๊ฐ๊ฒฐํจ์ ์ํด ๋งํฌ๋ ์๋ต).
Source:
edup
Pruning Results
| Strategy | Removed Layers | Speedโup | Quality loss |
|---|---|---|---|
| Minimal | {3} | ~0.4โฏ% | ~5โฏ% |
| Moderate | {3, 5, 10, 11} | ~1โฏ% | ~18โฏ% |
| Aggressive | {3, 4, 5, 9, 10, 11, 12} | ~2.5โฏ% | ~32โฏ% |
Optimal strategy: remove least important layers
# Layers whose PPL increase
# **Important:** Never remove layers 0, 2, 15 โ they are critical points.
Ongoing research & related work
| Year | Project | Focus |
|---|---|---|
| 2024 | ShortGPT | Removing entire layers |
| 2024 | FinerCut | Removing components within layers |
| 2024 | SliceGPT | Removing rows/columns from weight matrices |
| 2025 | LinearPatch | Recovering 94โฏ% quality after pruning via Hadamard transform (arXiv) |
| 2025 | MRP (Maximum Redundancy Pruning) | Adaptive removal of most redundant layers (arXiv) |
| 2025 | CLP (Automatic segment search) | Finding optimal segments to remove (arXiv) |
Combining pruning with quantisation (INT4/INT8) can yield even greater speedโups.
Business impact
- Cost saving: For a $10โฏk/month inference GPU budget, pruning can save $2โ3โฏk without noticeable quality loss.
- Scale: At OpenAIโs scale, this translates to millions of dollars.
Caveats & considerations
- Model size: Results shown for TinyLlamaโฏ1.1B; may differ for 7โฏB / 70โฏB models.
- Metric limitation: Perplexity does not capture all quality aspects.
- Fineโtuning: Postโpruning fineโtuning can recover some lost quality.
- Dataset diversity: Experiments were run on a single dataset; broader testing is needed.
- Measurement variance: Speed on MPS backend varies ยฑ10โฏ%; run many trials for reliable numbers.
- Chainโofโthought degradation: Recent work (arXivโฏ2510.22228) shows that removing even 1โ2 layers can break multiโstep reasoning, while simple tasks remain unaffected.
Reproducibility
All experiment code is available on GitLab:
git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick
Key insights
- Layerโฏ2 is unexpectedly the most important (more so than Layerโฏ0).
- Layersโฏ3โ5 and 9โ12 are largely redundant and can be removed with minimal impact.
- Layerโฏ15 is a hidden critical layer in the later part of the network.
- Practical result: Removing 7 layers (22โฏโโฏ15) yields ~32โฏ% speedโup with ~2.5โฏ% quality loss.
Next steps
- Run the same pipeline on Llamaโ3โฏ8B for stronger validation.
- Explore pruningโฏ+โฏquantisation combinations.
- Investigate what critical layers (2 & 15) actually encode.
If you liked this, subscribe, star the GitLab repo, and share with colleagues.
Questions and suggestions? Drop a comment or DM.
Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning