The 24GB AI Lab: A Survival Guide to Full-Stack Local AI on Consumer Hardware
Source: Dev.to
1. The Docker + PyTorch Memory Guardrails
Before you even import a model, inject the two environment variables / arguments that stop the classic “CUDA Out‑of‑Memory” crashes.
# -------------------------------------------------
# 1️⃣ Memory Fix – make VRAM a dynamic pool
# -------------------------------------------------
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"
Multi‑GPU Bug Fix
When two GPUs are present, Unsloth/HuggingFace will, by default, try to average tokens across devices – which throws an obscure
AttributeError: 'int' object has no attribute 'mean'
Add the following flag to your TrainingArguments:
# -------------------------------------------------
# 2️⃣ Multi‑GPU Bug Fix – stop token averaging
# -------------------------------------------------
average_tokens_across_devices = False
2. Model‑Training Settings (the “sweet spot” for 24 GB total VRAM)
| Setting | Value | Why it matters |
|---|---|---|
max_seq_length | 1024 | Keeps the context window within the memory budget of two 12 GB cards. |
per_device_train_batch_size | 1 | Guarantees that each GPU only holds a single sample at a time. |
gradient_accumulation_steps | 8 | Processes 1 sample 8 × and then updates – same math, far less VRAM pressure. |
data_collator | DataCollatorForLanguageModeling | Prevents dimension‑mismatch errors when batching text dynamically. |
from transformers import TrainingArguments, DataCollatorForLanguageModeling
training_args = TrainingArguments(
output_dir="output",
max_seq_length=1024,
per_device_train_batch_size=1,
gradient_accumulation_steps=8,
# other args …
)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
)
3. Merging LoRA Back into the Base Model (the step most people skip)
A naïve model.save_pretrained_merged(...) tries to load both the base model and the LoRA into VRAM → instant freeze on 12 GB cards.
Force the heavy lifting onto system RAM:
# -------------------------------------------------
# 3️⃣ VRAM Insurance Policy – CPU offloading merge
# -------------------------------------------------
model.save_pretrained_merged(
"model_output",
tokenizer,
save_method="merged_4bit_forced", # optimal for Ollama
maximum_memory_usage=0.4, # 40 % of VRAM, rest on CPU
)
Result: The merge takes a few minutes longer but succeeds 100 % of the time.
4. Cleaning the Exported Safetensors (the “Header Stripper”)
Ollama often rejects the exported .gguf / .safetensors because PyTorch leaves non‑standard metadata (U8/U9 headers).
Run this tiny script inside the same Docker container to strip the headers:
# -------------------------------------------------
# 4️⃣ Washing Script – sanitize safetensors metadata
# -------------------------------------------------
import os
from safetensors.torch import load_file, save_file
def sanitize_metadata(input_dir: str, output_dir: str) -> None:
os.makedirs(output_dir, exist_ok=True)
for filename in os.listdir(input_dir):
if filename.endswith(".safetensors"):
src = os.path.join(input_dir, filename)
tensors = load_file(src)
# Re‑save with an *empty* metadata dict
dst = os.path.join(output_dir, filename)
save_file(tensors, dst, metadata={})
print(f"Sanitized: {filename}")
# Adjust these paths to match your Docker volume mounts
sanitize_metadata(
"/workspace/work/model_output",
"/workspace/work/sanitized_model",
)
Now point Ollama at the sanitized model file – it loads without “unexpected EOF” or “Tensor not found” errors.
5. Hooking Everything Together
- Run the fine‑tune inside the Unsloth Docker image (with the two guardrails from §1).
- Merge using the VRAM‑insurance call (§3).
- Sanitize the resulting safetensors (§4).
- Load the cleaned model into Ollama (best format:
merged_4bit_forced). - Configure OpenClaw to call the local Ollama endpoint.
- When a visual task appears, OpenClaw triggers your ComfyUI instance.
Because the model respects the 1024‑token context window, inference latency is essentially zero on the dual‑GPU rig.
6. “Gatekeeper” Errors & Their Fixes
| Error / Symptom | Likely Cause | “Hardware‑Aware” Fix |
|---|---|---|
| CUDA Out of Memory (OOM) during long training runs. | VRAM fragmentation inside the Docker container. | python\nos.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"\n (set before model init) |
| AttributeError: ‘int’ object has no attribute ‘mean’ | Multi‑GPU synchronization conflict in Unsloth/HuggingFace. | python\naverage_tokens_across_devices = False\n (pass to TrainingArguments) |
| Ollama create: unexpected EOF or Tensor not found | Unsanitized U8/U9 metadata headers in the safetensors file. | Run the Header Stripper script (see §4). |
System freeze during save_pretrained_merged. | Attempting to load base model and LoRA into VRAM simultaneously. | python\nmodel.save_pretrained_merged(..., maximum_memory_usage=0.4, save_method="merged_4bit_forced")\n |
| Docker container crashes when both GPUs are visible. | Docker defaults to a single‑GPU memory pool. | Launch Docker with --gpus all and ensure the environment variable from §1 is set inside the container. |
7. Quick Recap (one‑liner checklist)
1️⃣ export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
2️⃣ average_tokens_across_devices = False
3️⃣ max_seq_length = 1024
4️⃣ per_device_train_batch_size = 1
5️⃣ gradient_accumulation_steps = 8
6️⃣ use DataCollatorForLanguageModeling
7️⃣ model.save_pretrained_merged(..., maximum_memory_usage=0.4, save_method="merged_4bit_forced")
8️⃣ Run the sanitize_metadata() script on the output folder
9️⃣ Load the cleaned model into Ollama
🔟 Wire Ollama → OpenClaw → (optional) ComfyUI
Follow these steps, and your dual‑RTX 3060 rig will stay zero‑crash, fast, and ready for the next AI experiment. Happy fine‑tuning!
AI on a multi‑GPU rig isn’t about having the fastest hardware; it’s about being the best mechanic. By controlling your memory allocation, capping your context, and “washing” your metadata, you can turn consumer graphics cards into a highly capable, private, agentic laboratory.