Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows
Source: Dev.to
TL;DR
A local ChatGPT‑like stack using OpenWebUI as the UI and llama.cpp as the inference server, with a GGUF model from Hugging Face. Everything talks over an OpenAI‑compatible API, so no API bills and no data leaves your machine.
- Privacy: Prompts and replies stay on your machine.
- No API bills: No usage‑based pricing or quotas.
- Control: Choose the model, quantization, and context size.
- Open source: OpenWebUI and llama.cpp are free and auditable.
System requirements
| Component | Minimum | Recommended |
|---|---|---|
| OS | Windows 11 | – |
| RAM | 16 GB | 32 GB (helps for larger models) |
| GPU | optional (recommended for speed) | 8 GB VRAM or more |
| Disk | Enough for multi‑GB model files (4–8 GB per model) | – |
A 7 B model in Q4 quantization runs on many machines; larger models need more memory.
Three pieces of the stack
- OpenWebUI – the browser UI (chat, history, model selection).
- llama.cpp server – local inference with an OpenAI‑compatible HTTP API.
- GGUF model – weights you download once and keep on disk.
OpenWebUI talks to the llama‑server over HTTP. No cloud is involved.
Install llama.cpp (pre‑built binaries)
-
Open PowerShell and run
nvidia-smito see your CUDA version (e.g., 12.x). -
Go to the llama.cpp releases page and download the build that matches your CUDA version.
-
Download the CUDA runtime DLL bundle from the Assets (e.g.,
cudart-llama-bin-win-cuda-12). -
Extract the main archive to a stable folder, e.g.:
C:\Program Files\llama.cpp\ -
Add that folder to your system PATH
(Windows Search → Environment Variables → Path → Edit → New). -
Extract the CUDA runtime bundle and copy all
.dllfiles into the same folder asllama-server.exe. -
Open a new terminal (so the updated PATH is loaded) and verify the install:
llama-server --helpIf you see the help output, the installation succeeded.
Install OpenWebUI (Python virtual environment)
Using venv
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install open-webui
Alternative: Conda
conda create -n local_chat python=3.11 -y
conda activate local_chat
pip install open-webui
Start the UI:
open-webui serve
Open your browser at http://localhost:8080 (or the port shown in the terminal). The UI will be visible; the model connection is configured in the next step.
Download a GGUF model
For this guide we use Qwen2.5‑Coder‑7B‑Instruct‑GGUF (Q4_K_M quantization). On the Hugging Face repository you’ll find several quantizations (Q2–Q8); Q4 offers a good balance of size and quality.
-
Download the
Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguffile. -
Place it in a permanent folder, e.g.:
C:\Users\<your_username>\.llm_models\Replace
<your_username>with your Windows username.
Run the llama.cpp server
Choose a port that does not clash with OpenWebUI (which uses 8080). Here we use 10000:
llama-server -m "C:\Users\<your_username>\.llm_models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" --port 10000
Leave this terminal open. The server now provides an OpenAI‑compatible API at:
http://localhost:10000
Connect OpenWebUI to the llama.cpp server
-
In the browser, open OpenWebUI (
http://localhost:8080). -
Navigate to Settings → Connections (or Admin → Connections, depending on version).
-
Add a new OpenAI‑compatible connection:
- Base URL:
http://localhost:10000/v1 - API key: leave empty or use a placeholder such as
localif required.
- Base URL:
-
Save the connection, select it as the active model, and send a test message. If the model replies, the stack is working.
Troubleshooting
| Issue | Suggested fix |
|---|---|
| OpenWebUI loads but no model appears | Verify that llama-server is running and that http://localhost:10000 responds (try in a browser or with curl). |
| Connection fails | Use http://127.0.0.1:10000 instead of http://localhost:10000. Check Windows Firewall settings. |
| Performance is slow | Try a smaller model or a lower‑precision quantization (e.g., Q4). Reduce the context length if you increased it. On NVIDIA GPUs, ensure you are using the CUDA build and that the runtime DLLs are in the same folder as the executables. |
What you get
- A fully local chat stack: OpenWebUI for the UI, llama.cpp for inference, and a GGUF model from Hugging Face.
- No external API calls, no paid subscriptions.
- The only limits are RAM/VRAM and disk space (models are often several GB each).
Next steps
- Experiment with different models (general‑purpose vs. coder).
- Try other quantizations (Q2, Q5, Q8) to balance speed and quality.
- Adjust context length to suit your workload.