Run Your Own Local AI Chat with OpenWebUI and llama.cpp - Windows

Published: (February 28, 2026 at 03:16 PM EST)
4 min read
Source: Dev.to

Source: Dev.to

TL;DR

A local ChatGPT‑like stack using OpenWebUI as the UI and llama.cpp as the inference server, with a GGUF model from Hugging Face. Everything talks over an OpenAI‑compatible API, so no API bills and no data leaves your machine.

  • Privacy: Prompts and replies stay on your machine.
  • No API bills: No usage‑based pricing or quotas.
  • Control: Choose the model, quantization, and context size.
  • Open source: OpenWebUI and llama.cpp are free and auditable.

System requirements

ComponentMinimumRecommended
OSWindows 11
RAM16 GB32 GB (helps for larger models)
GPUoptional (recommended for speed)8 GB VRAM or more
DiskEnough for multi‑GB model files (4–8 GB per model)

A 7 B model in Q4 quantization runs on many machines; larger models need more memory.

Three pieces of the stack

  1. OpenWebUI – the browser UI (chat, history, model selection).
  2. llama.cpp server – local inference with an OpenAI‑compatible HTTP API.
  3. GGUF model – weights you download once and keep on disk.

OpenWebUI talks to the llama‑server over HTTP. No cloud is involved.

Install llama.cpp (pre‑built binaries)

  1. Open PowerShell and run nvidia-smi to see your CUDA version (e.g., 12.x).

  2. Go to the llama.cpp releases page and download the build that matches your CUDA version.

  3. Download the CUDA runtime DLL bundle from the Assets (e.g., cudart-llama-bin-win-cuda-12).

  4. Extract the main archive to a stable folder, e.g.:

    C:\Program Files\llama.cpp\
  5. Add that folder to your system PATH
    (Windows Search → Environment VariablesPathEditNew).

  6. Extract the CUDA runtime bundle and copy all .dll files into the same folder as llama-server.exe.

  7. Open a new terminal (so the updated PATH is loaded) and verify the install:

    llama-server --help

    If you see the help output, the installation succeeded.

Install OpenWebUI (Python virtual environment)

Using venv

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install open-webui

Alternative: Conda

conda create -n local_chat python=3.11 -y
conda activate local_chat
pip install open-webui

Start the UI:

open-webui serve

Open your browser at http://localhost:8080 (or the port shown in the terminal). The UI will be visible; the model connection is configured in the next step.

Download a GGUF model

For this guide we use Qwen2.5‑Coder‑7B‑Instruct‑GGUF (Q4_K_M quantization). On the Hugging Face repository you’ll find several quantizations (Q2–Q8); Q4 offers a good balance of size and quality.

  1. Download the Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf file.

  2. Place it in a permanent folder, e.g.:

    C:\Users\<your_username>\.llm_models\

    Replace <your_username> with your Windows username.

Run the llama.cpp server

Choose a port that does not clash with OpenWebUI (which uses 8080). Here we use 10000:

llama-server -m "C:\Users\<your_username>\.llm_models\Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" --port 10000

Leave this terminal open. The server now provides an OpenAI‑compatible API at:

http://localhost:10000

Connect OpenWebUI to the llama.cpp server

  1. In the browser, open OpenWebUI (http://localhost:8080).

  2. Navigate to Settings → Connections (or Admin → Connections, depending on version).

  3. Add a new OpenAI‑compatible connection:

    • Base URL: http://localhost:10000/v1
    • API key: leave empty or use a placeholder such as local if required.
  4. Save the connection, select it as the active model, and send a test message. If the model replies, the stack is working.

Troubleshooting

IssueSuggested fix
OpenWebUI loads but no model appearsVerify that llama-server is running and that http://localhost:10000 responds (try in a browser or with curl).
Connection failsUse http://127.0.0.1:10000 instead of http://localhost:10000. Check Windows Firewall settings.
Performance is slowTry a smaller model or a lower‑precision quantization (e.g., Q4). Reduce the context length if you increased it. On NVIDIA GPUs, ensure you are using the CUDA build and that the runtime DLLs are in the same folder as the executables.

What you get

  • A fully local chat stack: OpenWebUI for the UI, llama.cpp for inference, and a GGUF model from Hugging Face.
  • No external API calls, no paid subscriptions.
  • The only limits are RAM/VRAM and disk space (models are often several GB each).

Next steps

  • Experiment with different models (general‑purpose vs. coder).
  • Try other quantizations (Q2, Q5, Q8) to balance speed and quality.
  • Adjust context length to suit your workload.
0 views
Back to Blog

Related posts

Read more »

Google Gemini Writing Challenge

What I Built - Where Gemini fit in - Used Gemini’s multimodal capabilities to let users upload screenshots of notes, diagrams, or code snippets. - Gemini gener...