Stop Fine-Tuning Blindly: When to Fine-Tune—and When Not to Touch Model Weights
Source: Dev.to
Fine‑Tuning Is a Knife, Not a Hammer
Fine‑tuning has a reputation problem.
- Some people treat it like magic: “Just fine‑tune and the model will understand our domain.”
- Others treat it like a sin: “Never touch weights, it’s all prompt engineering now.”
Both are wrong.
Fine‑tuning is a precision tool. Used well, it turns a generic model into a specialist. Used badly, it burns GPU budgets, bakes in bias, and ships a model that performs worse than the base.
This is a field guide
What types of fine‑tuning exist, what they cost, how to run them, and the traps that quietly ruin outcomes.
There are multiple ways to classify fine‑tuning. The cleanest is to look at:
- What changes – which parameters are updated.
- What signal you train on – labelled pairs, rewards, etc.
- What model type you’re adapting – language, vision, multimodal.
1️⃣ Full‑Model Fine‑Tuning
Definition: Update all model weights so the model fully adapts to the new task.
Traits
- Maximum flexibility, maximum cost.
- Requires strong data quality and careful regularisation.
- Risk: catastrophic forgetting (the model “forgets” general abilities).
When it makes sense
- You have a stable task and a solid dataset (usually 10 k–100 k+ high‑quality samples).
- You can afford experiments and regression testing.
- You need deeper behavioural change than PEFT can deliver.
2️⃣ Parameter‑Efficient Fine‑Tuning (PEFT)
Definition: Freeze most weights and train a small, targeted set of parameters.
You get most of the gains with a fraction of the cost.
Common PEFT sub‑types
| Sub‑type | What it does | Typical cost |
|---|---|---|
| Adapters | Insert small modules inside transformer blocks; train only those adapter weights (a few % of total parameters). | Low |
| Prompt vectors / Prefixes | Train learnable “prompt vectors” that steer behaviour. | Very low |
| Soft prompts | Continuous vectors (trained). | Very low |
| Hard prompts | Discrete tokens (rarely “trained” in the same way). | N/A |
| LoRA | Decomposes weight updates into low‑rank matrices. | Low‑to‑moderate |
| QLoRA | Runs LoRA on a quantised base model (often 4‑bit), slashing VRAM requirements and making “big‑ish” fine‑tuning viable on consumer GPUs. | Very low |
Why LoRA wins
- You store only the delta ( \Delta W ) (tiny).
- Easy to swap adapters per task.
- Strong performance per compute.
3️⃣ Training Signals
| Signal | Typical use‑cases |
|---|---|
| Labelled input‑output pairs | Classification, extraction, instruction following (instruction tuning), style/tone adaptation. |
| Reward‑model + policy optimisation (RLHF) | SFT → reward model → PPO. |
| Direct Preference Optimisation (DPO) | Simpler operationally; aligns model to preferences. |
| Embedding‑level objectives | Retrieval, similarity, embedding quality (less common for everyday text generation). |
4️⃣ Modalities
| Modality | Typical models | Fine‑tuning notes |
|---|---|---|
| NLP | BERT, GPT, T5 | Instruction tuning & chain‑of‑thought supervision are common. |
| Vision | ResNet, ViT | Progressive unfreezing & strong augmentation matter. |
| Multimodal | CLIP, BLIP, Flamingo | Biggest challenge: aligning representations across modalities. |
5️⃣ When Fine‑Tuning Shines
- Domain‑specific jargon – e.g., finance risk text where the base model misreads terms like short, subprime, haircut.
- Stabilising behaviour – a model that produces “sometimes great” answers is a nightmare in production; fine‑tuning can reduce variance and prompt complexity.
- On‑prem / latency constraints – self‑hosted models + PEFT are often the only workable path when data residency or latency budgets are strict.
6️⃣ Expensive Mistakes to Avoid
| Mistake | Why it hurts |
|---|---|
| 10 B (fine‑tuning a 10 B model) | Requires QLoRA or multi‑GPU; needs 80 GB+ VRAM (multi‑card); high memory and throughput costs |
Production‑system checklist
- Checkpoints – storage balloons fast; keep a retention policy.
- Inference latency testing – capture p50 / p95 / p99.
- Versioning – track base model, adapters, and config files.
Monitoring metrics
- Train vs. validation loss divergence (over‑fitting).
- Task metric (F1 / AUC / accuracy) over time.
- Gradient norms (explosions or vanishing).
- GPU utilisation & VRAM (to catch bottlenecks).
Early stopping is not optional in small‑data regimes.
Validation may look amazing while test performance collapses.
Common issues & fixes
| Symptom | Fix |
|---|---|
| Group‑aware / temporal splits needed | Use group‑aware or time‑based splits; deduplicate aggressively. |
| Model learns the majority class | Apply class weighting, resampling, or switch to a metric like F1 (more informative than accuracy). |
| Over‑fitting on small data | Match model size to data, prefer PEFT (LoRA/QLoRA), add regularisation. |
| High AUC but fails latency/memory budgets | Benchmark early; export to ONNX/TensorRT if needed; consider distillation for latency. |
Quick chooser
| Data size | Recommended approach |
|---|---|
| 10 000 | Full FT can make sense (if evaluation & regression are solid) |
| VRAM tight | QLoRA |
| Need preference alignment | DPO / RLHF‑style preference training |
| Task changes often | Avoid weight updates; design workflow‑centric solutions |
Final thoughts
Successful fine‑tuning isn’t a single training run – it’s a loop:
data → training → evaluation → deployment constraints → monitoring → back to data
Treat it as an engineering system, not a one‑off experiment. In 2026, PEFT methods like LoRA and QLoRA give the best trade‑off curve: strong performance gains, manageable cost, and deployable artefacts.
Goal: not a model that’s “smart in a notebook,” but a model that’s reliable in production.