The Homelab AI Stack in 2026: What Self-Hosters Are Actually Running
Source: Dev.to
Spend five minutes on r/selfhosted and you’ll notice: the conversations have changed.
Two years ago everyone asked “what should I run?” Now they’re sharing sophisticated stacks that rival small‑business infrastructure. The self‑hosting AI movement has matured. Here’s what’s actually worth deploying in 2026.
The Core Stack (What Stayed)
Ollama — Local LLM Runtime
Ollama won. It beat LocalAI on simplicity, beat llama.cpp on UX, and the model library makes pulling new models trivial.
# Install
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the best‑value model for 16 GB RAM
ollama pull qwen2.5:14b
# Or for 24 GB+ (M4 Mac mini, high‑RAM PC)
ollama pull qwen2.5:32b
# Test immediately
ollama run qwen2.5:14b "Explain what makes a good Docker Compose file"
Hardware reality check
| RAM | Practical model size | Typical use |
|---|---|---|
| 8 GB | 7 B | Basic tasks |
| 16 GB | 14 B | Solid capability |
| 24 GB (M4 Mac mini sweet spot) | 32 B | Near GPT‑4 quality |
| 32 GB+ | 70 B | Excellent for everything |
Open WebUI — The Interface
Deploys in ~2 minutes and gives you a ChatGPT‑equivalent UI locally.
# docker-compose.yml
services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
volumes:
- open-webui:/app/backend/data
environment:
- OLLAMA_BASE_URL=http://host.docker.internal:11434
ports:
- "3000:8080"
extra_hosts:
- "host.docker.internal:host-gateway"
restart: unless-stopped
volumes:
open-webui:
n8n — Automation Brain
For connecting AI to everything else. Self‑hosted, no per‑workflow limits, full control.
Killer use case in 2026: n8n + Ollama = private AI automations that cost $0/month to run.
My actual running workflows:
- Gmail → Ollama triage → priority flag → Telegram alert
- RSS feeds → Ollama summary → daily digest at 7 am
- Server logs → Ollama anomaly check → alert if weird
What Got Replaced in 2026
| Replaced | Replaced by |
|---|---|
| LocalAI | Ollama |
| Flowise | n8n |
| Custom Python scripts | n8n workflows |
Why? Ollama is more feature‑complete, n8n handles AI and everything else, and n8n workflows are inspectable, editable, and debuggable without touching code.
What Got Added in 2026
Whisper.cpp — Local Audio Transcription
brew install whisper-cpp # or build from source for max performance
# Transcribe any audio file
whisper-cpp --model base.en audio.mp3
Use cases: meeting transcription, voice‑notes → text, local podcast search.
LiteLLM — The Unified Proxy
LiteLLM sits in front of all your AI models and presents a single OpenAI‑compatible API endpoint.
# docker-compose.yml (excerpt)
services:
litellm:
image: ghcr.io/berriai/litellm:main-latest
ports:
- "4000:4000"
environment:
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
- OPENAI_API_KEY=${OPENAI_API_KEY}
volumes:
- ./litellm_config.yaml:/app/config.yaml
Now every app in your stack — n8n, Open WebUI, your scripts — points to http://litellm:4000 and you switch models by editing a single config file.
ChromaDB + LlamaIndex — Private RAG
Search your own documents with AI. All local, all private.
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.chroma import ChromaVectorStore
import chromadb
# Index your documents
docs = SimpleDirectoryReader('/your/docs/folder').load_data()
db = chromadb.PersistentClient(path='./chroma_db')
collection = db.get_or_create_collection('my_docs')
store = ChromaVectorStore(chroma_collection=collection)
# Query them
index = VectorStoreIndex.from_documents(docs, vector_store=store)
engine = index.as_query_engine()
response = engine.query('What did we decide about the API architecture?')
print(response)
The Hardware Question
GPU server vs. Apple Silicon?
In 2026, for pure AI inference at homelab scale, Apple Silicon wins on value.
| Device | Typical performance | Pros | Cons |
|---|---|---|---|
| M4 Mac mini (24 GB, ~$800) | 32 B models @ 10‑15 tokens / sec | Silent, 30 W idle, no separate GPU, macOS = easy maintenance | Limited to Apple ecosystem |
| NVIDIA RTX 4090 server (24 GB VRAM) | Faster on large batches, better for fine‑tuning | Superior raw throughput, good for training | Loud, 450 W under load, Linux‑only, higher cost |
- Homelab with 1‑5 concurrent users (text tasks): Mac mini M4.
- Serious inference throughput or training: GPU server.
The Monitoring Stack
Don’t run AI services without knowing when they break.
- Uptime Kuma – health checks for Ollama, n8n, Open WebUI, etc.
- Netdata – per‑container resource usage.
- Loki + Grafana – aggregate logs from all containers.
# Example snippet for log collection (docker‑compose)
labels:
- logging=promtail
- logging_jobname=containerlogs
What I’d Set Up First on a New Server
In order, if starting from scratch:
- Traefik – reverse proxy + automatic HTTPS (everything else goes behind it).
- Ollama – pull
qwen2.5:14bfirst, add others as needed. - Open WebUI – UI for chatting with the models.
- n8n – automation workflows.
- LiteLLM – unified API endpoint.
- ChromaDB + LlamaIndex – private RAG.
- Whisper.cpp – local transcription.
- Monitoring stack – Uptime Kuma, Netdata, Loki + Grafana.
That’s the practical, battle‑tested stack many self‑hosters are running in 2026. Happy building!
Immediate Usable Interface
- n8n — automation brain
- LiteLLM — unified API proxy
- Uptime Kuma — monitoring
- Vaultwarden — password manager (you’ll need it)
The One Thing Most People Miss
Running models locally is only half the value.
The other half is connecting them to your actual workflow — your email, your calendar, your codebase, your documents. A local LLM that just answers questions in a chat window is the same as a very slow, private version of ChatGPT.
A local LLM wired into n8n that automatically triages your email, monitors your servers, and summarizes your notes — that’s actual leverage.
SIGNAL publishes weekly. Follow @signal-weekly for more practical builder content.
Next: How I use AI agents to automate the boring parts of running a homelab — specific n8n workflows, working code.