Fine-Tuning LLMs on Consumer GPUs: A Practical Guide to QLoRA
Source: Dev.to
Introduction
Fine‑tuning large language models (LLMs) on consumer‑grade GPUs is now feasible. Using QLoRA (Quantized Low‑Rank Adaptation) you can fit a 7 B model on a single RTX 3090 (24 GB VRAM) and train it for a few hours without any cloud credits.
- Quantization: compress model weights from 32‑bit to 4‑bit.
- LoRA: train small adapter layers instead of the full model.
Result: a 7 B model that normally needs ~28 GB VRAM fits in ~6 GB.
Hardware & System Requirements
| Component | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3090 (24 GB) – works with RTX 3080 | RTX 3090 |
| System RAM | 16 GB | 32 GB |
| Free storage | 50 GB | 50 GB+ |
| OS | Linux/macOS/Windows (CUDA support) | Linux |
Installation
pip install torch transformers peft bitsandbytes trl datasets
Model Loading & Quantization
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")
LoRA Configuration
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # rank
lora_alpha=32, # scaling factor
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 20,971,520 || all params: 7,261,749,248
# trainable%: 0.29%
Dataset
The guide uses the Brazilian customer‑service conversation dataset:
from datasets import load_dataset
dataset = load_dataset("RichardSakaguchiMS/brazilian-customer-service-conversations")
Formatting
def format_example(example):
return {
"text": f"""[INST] {example['input']} [/INST]
{example['output']}"""
}
dataset = dataset.map(format_example)
Training
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
bf16=True,
logging_steps=10,
save_strategy="epoch",
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset["train"],
args=training_args,
tokenizer=tokenizer,
max_seq_length=2048,
dataset_text_field="text",
)
trainer.train()
Optional VRAM Savings
model.gradient_checkpointing_enable() # ~20 % slower but saves memory
Faster Attention (if supported)
model = AutoModelForCausalLM.from_pretrained(
model_name,
attn_implementation="flash_attention_2"
)
Training Summary
| Metric | Value |
|---|---|
| Dataset size | 10 000 examples |
| Epochs | 3 |
| Batch size (effective) | 4 × 4 (gradient accumulation) |
| Training time | ~4 hours |
| Peak VRAM | 18 GB |
| Final loss | 0.82 |
Inference
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.3",
quantization_config=bnb_config,
)
model = PeftModel.from_pretrained(base_model, "./output")
prompt = "[INST] Cliente: Quero saber do meu pedido [/INST]"
outputs = model.generate(
tokenizer.encode(prompt, return_tensors="pt"),
max_new_tokens=200,
)
print(tokenizer.decode(outputs[0]))
Simple prompt engineering works well for short, domain‑specific queries.
Practical Tips
- Data quality matters more than quantity: 1 000 high‑quality examples often outperform 100 000 noisy ones.
- Cleaning: ensure consistent formatting before training.
- Learning‑rate adjustments:
- Loss plateaus → increase LR.
- Loss spikes → decrease LR.
- Oscillating loss → lower batch size or LR.
- When to fine‑tune: