Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows

Published: 3 days ago (February 28, 2026 at 03:16 PM EST)

4 min read

Source: Dev.to

TL;DR

A local ChatGPT‑like stack using OpenWebUI as the UI and llama.cpp as the inference server, with a GGUF model from Hugging Face. Everything talks over an OpenAI‑compatible API, so no API bills and no data leaves your machine.

Privacy: Prompts and replies stay on your machine.
No API bills: No usage‑based pricing or quotas.
Control: Choose the model, quantization, and context size.
Open source: OpenWebUI and llama.cpp are free and auditable.

System requirements

Component	Minimum	Recommended
OS	Windows 11	–
RAM	16 GB	32 GB (helps for larger models)
GPU	optional (recommended for speed)	8 GB VRAM or more
Disk	Enough for multi‑GB model files (4–8 GB per model)	–

A 7 B model in Q4 quantization runs on many machines; larger models need more memory.

Three pieces of the stack

OpenWebUI – the browser UI (chat, history, model selection).
llama.cpp server – local inference with an OpenAI‑compatible HTTP API.
GGUF model – weights you download once and keep on disk.

OpenWebUI talks to the llama‑server over HTTP. No cloud is involved.

Install llama.cpp (pre‑built binaries)

Open PowerShell and run nvidia-smi to see your CUDA version (e.g., 12.x).
Go to the llama.cpp releases page and download the build that matches your CUDA version.
Download the CUDA runtime DLL bundle from the Assets (e.g., cudart-llama-bin-win-cuda-12).
Extract the main archive to a stable folder, e.g.:
```
C:\Program Files\llama.cpp\
```
Add that folder to your system PATH
(Windows Search → Environment Variables → Path → Edit → New).
Extract the CUDA runtime bundle and copy all .dll files into the same folder as llama-server.exe.
Open a new terminal (so the updated PATH is loaded) and verify the install:
```
llama-server --help
```
If you see the help output, the installation succeeded.

Install OpenWebUI (Python virtual environment)

Using `venv`

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install open-webui

Alternative: Conda

conda create -n local_chat python=3.11 -y
conda activate local_chat
pip install open-webui

Start the UI:

open-webui serve

Open your browser at http://localhost:8080 (or the port shown in the terminal). The UI will be visible; the model connection is configured in the next step.

Download a GGUF model

For this guide we use Qwen2.5‑Coder‑7B‑Instruct‑GGUF (Q4_K_M quantization). On the Hugging Face repository you’ll find several quantizations (Q2–Q8); Q4 offers a good balance of size and quality.

Download the Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf file.
Place it in a permanent folder, e.g.:
```
C:\Users\<your_username>\.llm_models\
```
Replace <your_username> with your Windows username.

Run the llama.cpp server

Choose a port that does not clash with OpenWebUI (which uses 8080). Here we use 10000:

llama-server -m "C:\Users\<your_username>\.llm_models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" --port 10000

Leave this terminal open. The server now provides an OpenAI‑compatible API at:

http://localhost:10000

Connect OpenWebUI to the llama.cpp server

In the browser, open OpenWebUI (http://localhost:8080).
Navigate to Settings → Connections (or Admin → Connections, depending on version).
Add a new OpenAI‑compatible connection:
- Base URL: http://localhost:10000/v1
- API key: leave empty or use a placeholder such as local if required.
Save the connection, select it as the active model, and send a test message. If the model replies, the stack is working.

Troubleshooting

Issue	Suggested fix
OpenWebUI loads but no model appears	Verify that `llama-server` is running and that `http://localhost:10000` responds (try in a browser or with `curl`).
Connection fails	Use `http://127.0.0.1:10000` instead of `http://localhost:10000`. Check Windows Firewall settings.
Performance is slow	Try a smaller model or a lower‑precision quantization (e.g., Q4). Reduce the context length if you increased it. On NVIDIA GPUs, ensure you are using the CUDA build and that the runtime DLLs are in the same folder as the executables.

What you get

A fully local chat stack: OpenWebUI for the UI, llama.cpp for inference, and a GGUF model from Hugging Face.
No external API calls, no paid subscriptions.
The only limits are RAM/VRAM and disk space (models are often several GB each).

Next steps

Experiment with different models (general‑purpose vs. coder).
Try other quantizations (Q2, Q5, Q8) to balance speed and quality.
Adjust context length to suit your workload.

Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows

TL;DR

System requirements

Three pieces of the stack

Install llama.cpp (pre‑built binaries)

Install OpenWebUI (Python virtual environment)

Using `venv`

Alternative: Conda

Download a GGUF model

Run the llama.cpp server

Connect OpenWebUI to the llama.cpp server

Troubleshooting

What you get

Next steps

Useful links

Related posts

Shared Workflows: minha experiência definindo pipelines reutilizáveis

Building a Local-First Financial IDE: How I forced Gemini AI to do strict Double-Entry Accounting

I ran cursor-doctor on 50 real projects. Here's what broke.

Google Gemini Writing Challenge

TL;DR

System requirements

Three pieces of the stack

Install llama.cpp (pre‑built binaries)

Install OpenWebUI (Python virtual environment)

Using venv

Alternative: Conda

Download a GGUF model

Run the llama.cpp server

Connect OpenWebUI to the llama.cpp server

Troubleshooting

What you get

Next steps

Useful links

Related posts

Shared Workflows: minha experiência definindo pipelines reutilizáveis

Building a Local-First Financial IDE: How I forced Gemini AI to do strict Double-Entry Accounting

I ran cursor-doctor on 50 real projects. Here's what broke.

Google Gemini Writing Challenge

Install llama.cpp (pre‑built binaries)

Install OpenWebUI (Python virtual environment)

Using `venv`