🧠✂️ Neural Network Lobotomy: LLM에서 7개의 레이어를 제거 — 30% 더 빨라짐

발행: 1개월 전 (2026년 1월 10일 오전 02:46 GMT+9)

5 분 소요

I’m happy to translate the article for you, but I’ll need the full text of the post (excluding the source line you already provided). Could you please paste the content you’d like translated? Once I have it, I’ll keep the source link unchanged and translate the rest into Korean while preserving the original formatting and any code blocks or URLs.

TL;DR

제거 전략	속도 ↑	Perplexity Δ	품질 Δ	Works?
Baseline (제거 없음)	–	1.82	—	✅
중간 레이어 #11 제거	+10 % (59 → 64 tok/s)	1.89 (+4 %)	–4 %	✅
중간 레이어 3개 #10‑12 제거	+12 % (59 → 66 tok/s)	2.24 (+23 %)	–23 %	✅
첫 번째 레이어 #0 제거	+10 % (59 → 64 tok/s)	5.74 (+215 %)	–215 %	❌
7개의 “안전” 레이어 제거 (3, 4, 5, 9, 10, 11, 12)	+30 % (59 → 77 tok/s)	~1.87 (≈ ‑2.5 %)	–2.5 %	✅

모든 측정값은 MPS 백엔드에서 10회 실행(5회 워밍‑업) 평균입니다.

Motivation

스타트업은 LLM 추론을 위해 GPU에 수백만 달러를 투자합니다. OpenAI는 하루에 $700 k 정도를 컴퓨팅 비용만으로 사용한다고 알려졌습니다. 품질 손실이 눈에 띄지 않는 모델 가속 최적화는 직접적인 비용 절감으로 이어집니다.

레이어 프루닝은 하드웨어에 구애받지 않는 간단한 방법입니다:

최신 모델은 수십(또는 수백) 개의 레이어를 가지고 있습니다 (GPT‑4 ≈ 120+).
모든 레이어가 최종 성능에 동일하게 기여하는 것은 아닙니다.
일부 레이어는 모델이 “거의 눈치채지 못하게” 제거할 수 있습니다.

연구 ShortGPT (2024) 에서는 LLaMA‑2에서 **최대 25 %**의 레이어를 제거해도 된다는 결과를 보여주었습니다

Note: The “Aggressive” setting is shown for completeness; quality deteriorates quickly beyond the balanced configuration.

마무리 생각

초기 레이어는 위치 정보와 기본 토큰 관계를 인코딩합니다—이를 제거하면 재앙이 됩니다.
두 번째 레이어는 언어 패턴의 “결정화 지점”인 것으로 보이며, 예상외로 매우 중요합니다.
중간‑후반 레이어의 상당 부분은 이 작은 모델에 불필요하여, 적은 노력으로 더 빠른 추론을 가능하게 합니다.

향후 작업에서는 동적 프루닝(프롬프트별 레이어 활성화/비활성화)이나 지식‑증류를 탐구하여 불필요한 레이어의 기여를 더 슬림한 아키텍처에 녹여낼 수 있습니다.

모든 코드와 원시 측정 로그는 내 공개 GitHub 저장소에서 확인할 수 있습니다(간결함을 위해 링크는 생략).

Source:

edup

Pruning Results

Strategy	Removed Layers	Speed‑up	Quality loss
Minimal	`{3}`	~0.4 %	~5 %
Moderate	`{3, 5, 10, 11}`	~1 %	~18 %
Aggressive	`{3, 4, 5, 9, 10, 11, 12}`	~2.5 %	~32 %

Optimal strategy: remove least important layers

# Layers whose PPL increase
# **Important:** Never remove layers 0, 2, 15 – they are critical points.

Year	Project	Focus
2024	ShortGPT	Removing entire layers
2024	FinerCut	Removing components within layers
2024	SliceGPT	Removing rows/columns from weight matrices
2025	LinearPatch	Recovering 94 % quality after pruning via Hadamard transform (arXiv)
2025	MRP (Maximum Redundancy Pruning)	Adaptive removal of most redundant layers (arXiv)
2025	CLP (Automatic segment search)	Finding optimal segments to remove (arXiv)

Combining pruning with quantisation (INT4/INT8) can yield even greater speed‑ups.

Business impact

Cost saving: For a $10 k/month inference GPU budget, pruning can save $2–3 k without noticeable quality loss.
Scale: At OpenAI’s scale, this translates to millions of dollars.

Caveats & considerations

Model size: Results shown for TinyLlama 1.1B; may differ for 7 B / 70 B models.
Metric limitation: Perplexity does not capture all quality aspects.
Fine‑tuning: Post‑pruning fine‑tuning can recover some lost quality.
Dataset diversity: Experiments were run on a single dataset; broader testing is needed.
Measurement variance: Speed on MPS backend varies ±10 %; run many trials for reliable numbers.
Chain‑of‑thought degradation: Recent work (arXiv 2510.22228) shows that removing even 1–2 layers can break multi‑step reasoning, while simple tasks remain unaffected.

Reproducibility

All experiment code is available on GitLab:

git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick

Key insights

Layer 2 is unexpectedly the most important (more so than Layer 0).
Layers 3‑5 and 9‑12 are largely redundant and can be removed with minimal impact.
Layer 15 is a hidden critical layer in the later part of the network.
Practical result: Removing 7 layers (22 → 15) yields ~32 % speed‑up with ~2.5 % quality loss.

Next steps

Run the same pipeline on Llama‑3 8B for stronger validation.
Explore pruning + quantisation combinations.
Investigate what critical layers (2 & 15) actually encode.

If you liked this, subscribe, star the GitLab repo, and share with colleagues.

Questions and suggestions? Drop a comment or DM.

Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning

🧠✂️ Neural Network Lobotomy: LLM에서 7개의 레이어를 제거 — 30% 더 빨라짐

TL;DR

Motivation

마무리 생각

edup

Pruning Results

Optimal strategy: remove least important layers

Business impact

Caveats & considerations

Reproducibility

Key insights

Next steps

관련 글

Cowork: 나머지 작업을 위한 Claude Code

왜 당신의 LLM 비용이 폭증하고 있는가 — 그리고 semantic caching이 비용을 73% 절감할 수 있는 방법

`/context` 명령: 토큰을 위한 X-Ray Vision

TimeCapsuleLLM: 1800‑1875 데이터만으로 훈련된 LLM

TL;DR

Motivation

마무리 생각

edup

Pruning Results

Optimal strategy: remove least important layers

Ongoing research & related work

Business impact

Caveats & considerations

Reproducibility

Key insights

Next steps

관련 글

Cowork: 나머지 작업을 위한 Claude Code

왜 당신의 LLM 비용이 폭증하고 있는가 — 그리고 semantic caching이 비용을 73% 절감할 수 있는 방법

`/context` 명령: 토큰을 위한 X-Ray Vision

TimeCapsuleLLM: 1800‑1875 데이터만으로 훈련된 LLM