Running Claude Code with Local LLMs via vLLM and LiteLLM
Source: Dev.to
Running Claude Code Locally with vLLM + LiteLLM
Goal: Keep proprietary source code on‑premises while still using Claude Code’s workflow.
Solution: Proxy Claude Code’s Anthropic Messages API calls through LiteLLM, which translates them to the OpenAI‑compatible API spoken by a local vLLM inference server.
Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Local GPU
1. One‑line environment variable
export ANTHROPIC_BASE_URL="http://localhost:4000"
Claude Code now points at the LiteLLM proxy instead of Anthropic’s cloud endpoint.
2. Model & Hardware
- Model:
Qwen3‑Coder‑30B‑A3B‑Instruct‑AWQ(Mixture‑of‑Experts, 30 B total, 3 B active per forward) - GPUs: Dual AMD MI60 (ROCm) using tensor‑parallelism (size = 2)
- Quantisation: AWQ → fits comfortably in GPU memory
2.1 vLLM Docker service (docker‑compose snippet)
services:
vllm:
image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
container_name: vllm
devices:
- /dev/kfd:/dev/kfd
- /dev/dri/card1:/dev/dri/card1
- /dev/dri/card2:/dev/dri/card2
- /dev/dri/renderD128:/dev/dri/renderD128
- /dev/dri/renderD129:/dev/dri/renderD129
shm_size: 16g
environment:
- HIP_VISIBLE_DEVICES=0,1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model
- QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
- --tensor-parallel-size
- "2"
- --max-model-len
- "65536"
- --gpu-memory-utilization
- "0.9"
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
Why the last two flags?
--enable-auto-tool-choice– lets the model decide when to emit a tool call.--tool-call-parser qwen3_coder– converts Qwen’s XML‑style tool calls into the OpenAI tool‑call format expected by LiteLLM (and ultimately by Claude Code).
3. LiteLLM Configuration
model_list:
- model_name: claude-*
litellm_params:
model: hosted_vllm/QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
api_base: http://vllm:8000/v1
api_key: "not-needed"
model_info:
max_tokens: 65536
max_input_tokens: 57344
max_output_tokens: 8192
litellm_settings:
drop_params: true # ignore Anthropic‑only params
request_timeout: 600
modify_params: true # adapt params for OpenAI API
general_settings:
disable_key_check: true # no API key needed locally
Settings explained
| Setting | Effect |
|---|---|
drop_params: true | Silently drops Anthropic‑specific fields that have no OpenAI counterpart. |
modify_params: true | Allows LiteLLM to rewrite parameters (e.g., max_tokens) to match the target API’s expectations. |
disable_key_check: true | Skips API‑key validation – useful when the server runs without authentication. |
4. Running Claude Code with the local stack
export ANTHROPIC_BASE_URL="http://localhost:4000"
cd my-project
claude # launches Claude Code as usual
Result: Identical user experience to the hosted Anthropic API, but all inference happens on‑prem.
Performance & Limits
| Metric | Observation |
|---|---|
| Throughput | ~25‑30 tokens / s on dual MI60, ~175 ms time‑to‑first‑token |
| Context window | Capped at 64 K tokens (Claude Opus can go to 200 K) |
| Model capability | Qwen3‑Coder excels at coding; Claude has broader general knowledge and instruction following. |
Upsides
- Zero API cost
- Full data sovereignty (code never leaves your network)
- Works on air‑gapped environments
5. End‑to‑End Test: Building a Flask Todo App
export ANTHROPIC_BASE_URL="http://localhost:4000"
cd /tmp && mkdir flask-test && cd flask-test
claude --dangerously-skip-permissions -p \
"Build a Flask todo app with SQLite persistence, \
modern UI with gradients and animations, \
mobile responsive design, and full CRUD operations."
Generated project structure
flask_todo_app/
├── app.py # Flask routes + SQLite setup
├── requirements.txt # Dependencies
├── run_app.sh # Launch script
├── static/
│ ├── css/
│ │ └── style.css # Gradients, animations, hover effects
│ └── js/
│ └── script.js # Client‑side interactions
└── templates/
└── index.html # Jinja2 template (responsive layout)
Sample app.py
from flask import Flask, render_template, request, redirect, url_for
import sqlite3
app = Flask(__name__)
def init_db():
conn = sqlite3.connect('todos.db')
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS todos
(id INTEGER PRIMARY KEY AUTOINCREMENT,
task TEXT NOT NULL,
completed BOOLEAN DEFAULT FALSE)''')
conn.commit()
conn.close()
init_db()
@app.route('/')
def index():
conn = sqlite3.connect('todos.db')
c = conn.cursor()
c.execute('SELECT id, task, completed FROM todos ORDER BY id DESC')
todos = c.fetchall()
conn.close()
return render_template('index.html', todos=todos)
Sample CSS (static/css/style.css)
body {
font-family: 'Poppins', sans-serif;
background: linear-gradient(135deg, #667eea, #764ba2);
min-height: 100vh;
padding: 20px;
}
.container {
max-width: 800px;
margin: 0 auto;
}
.header {
text-align: center;
padding: 40px 0;
color: white;
text-shadow: 0 2px 4px rgba(0,0,0,0.1);
}
Running the app
cd flask_todo_app
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python app.py # starts Flask on http://localhost:5000
You can now add, toggle, and delete tasks; the SQLite DB persists across restarts.
6. TL;DR
- Set
ANTHROPIC_BASE_URLto point at a LiteLLM proxy (http://localhost:4000). - Run vLLM (Docker) with the Qwen3‑Coder model and the
--enable-auto-tool-choice+--tool-call-parser qwen3_coderflags. - Configure LiteLLM to map any
claude-*model name to the local vLLM endpoint. - Launch Claude Code as usual – it now talks to your on‑prem GPU, giving you zero‑cost, sovereign inference while preserving the full Claude Code workflow (including tool calls).
Enjoy secure, high‑performance coding assistance without ever sending your proprietary code off‑site!
Generation Overview
- The generation took about five minutes across multiple agentic iterations.
- Each file is a separate tool call: the model generates, Claude Code executes, the result returns, and the model plans the next step.
- A 91 % prefix‑cache‑hit rate shows vLLM efficiently reusing context across the multi‑turn loop.
This confirms the agentic workflow functions correctly. The model reads the prompt, plans a file structure, emits tool calls to create directories and write files, and produces a functional application. All inference happens locally on the MI60s—no code leaves my network.
Limitations & Future Work
- Scale – Not tested on larger codebases. A small Flask app is one thing; a multi‑thousand‑line refactor is another.
- Context window – The 64 K token limit will eventually become a constraint, and the model may struggle with complex architectural decisions that the real Claude handles gracefully.
- Current suitability – Works well for focused, scoped tasks.
Claude Code Compatibility Checklist
| Requirement | Details |
|---|---|
| Strong tool use | The model must emit structured tool calls reliably |
| Code focus | Qwen3‑Coder works well; DeepSeek Coder and CodeLlama variants should also be viable |
| Sufficient context | I used 64 K; smaller windows may work but are untested |
Observations
- Qwen3‑Coder‑30B‑A3B handles straightforward coding tasks well.
- For complex refactoring or architectural decisions, the real Claude API remains the better choice.
Hardware Tips
- If you don’t have 64 GB of VRAM, smaller models like Qwen2.5‑Coder‑7B or Qwen3‑8B should fit on a single 16 GB or 24 GB card.
- I haven’t tested these configurations, so I can’t comment on their context limits or how well they handle Claude Code’s agentic workflows.
Workflow Advice
- Instead of broad “refactor this module” prompts, break work into tighter, more focused requests.
- More prompts of narrower scope play to a smaller model’s strengths.
Full Docker‑Compose Configuration
services:
vllm:
image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
container_name: vllm
restart: unless-stopped
ports:
- "8000:8000"
devices:
- /dev/kfd:/dev/kfd
- /dev/dri/card1:/dev/dri/card1
- /dev/dri/card2:/dev/dri/card2
- /dev/dri/renderD128:/dev/dri/renderD128
- /dev/dri/renderD129:/dev/dri/renderD129
group_add:
- "44"
- "992"
shm_size: 16g
volumes:
- /mnt/cache/huggingface:/root/.cache/huggingface:rw
environment:
- HIP_VISIBLE_DEVICES=0,1
command:
- python
- -m
- vllm.entrypoints.openai.api_server
- --model
- QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
- --tensor-parallel-size
- "2"
- --max-model-len
- "65536"
- --gpu-memory-utilization
- "0.9"
- --host
- "0.0.0.0"
- --port
- "8000"
- --enable-auto-tool-choice
- --tool-call-parser
- qwen3_coder
litellm:
image: litellm/litellm:v1.80.15-stable
container_name: litellm
restart: unless-stopped
ports:
- "4000:4000"
volumes:
- ./litellm-config.yaml:/app/config.yaml:ro
command:
- --config
- /app/config.yaml
- --port
- "4000"
- --host
- "0.0.0.0"
depends_on:
- vllm
Running the Stack
# Start (using nerdctl or Docker)
nerdctl compose -f coder.yaml up -d
From any machine on the network, point Claude Code at Feynman (the GPU workstation) and get local inference.
# When finished, tear it down
nerdctl compose -f coder.yaml down
Final Thoughts
- This setup won’t replace the Claude API for everyone.
- If you need maximum capability, Anthropic’s hosted models remain the best option.
- For those who care about data sovereignty, local inference means proprietary code never leaves the network.
- There’s also something satisfying about watching your own GPUs light up every time you ask Claude Code a question.