Running Claude Code with Local LLMs via vLLM and LiteLLM

Published: (February 4, 2026 at 09:12 PM EST)
6 min read
Source: Dev.to

Source: Dev.to

Running Claude Code Locally with vLLM + LiteLLM

Goal: Keep proprietary source code on‑premises while still using Claude Code’s workflow.
Solution: Proxy Claude Code’s Anthropic Messages API calls through LiteLLM, which translates them to the OpenAI‑compatible API spoken by a local vLLM inference server.

Claude Code → LiteLLM (port 4000) → vLLM (port 8000) → Local GPU

1. One‑line environment variable

export ANTHROPIC_BASE_URL="http://localhost:4000"

Claude Code now points at the LiteLLM proxy instead of Anthropic’s cloud endpoint.

2. Model & Hardware

  • Model: Qwen3‑Coder‑30B‑A3B‑Instruct‑AWQ (Mixture‑of‑Experts, 30 B total, 3 B active per forward)
  • GPUs: Dual AMD MI60 (ROCm) using tensor‑parallelism (size = 2)
  • Quantisation: AWQ → fits comfortably in GPU memory

2.1 vLLM Docker service (docker‑compose snippet)

services:
  vllm:
    image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
    container_name: vllm
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri/card1:/dev/dri/card1
      - /dev/dri/card2:/dev/dri/card2
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/renderD129:/dev/dri/renderD129
    shm_size: 16g
    environment:
      - HIP_VISIBLE_DEVICES=0,1
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      - --tensor-parallel-size
      - "2"
      - --max-model-len
      - "65536"
      - --gpu-memory-utilization
      - "0.9"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder

Why the last two flags?

  • --enable-auto-tool-choice – lets the model decide when to emit a tool call.
  • --tool-call-parser qwen3_coder – converts Qwen’s XML‑style tool calls into the OpenAI tool‑call format expected by LiteLLM (and ultimately by Claude Code).

3. LiteLLM Configuration

model_list:
  - model_name: claude-*
    litellm_params:
      model: hosted_vllm/QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      api_base: http://vllm:8000/v1
      api_key: "not-needed"
    model_info:
      max_tokens: 65536
      max_input_tokens: 57344
      max_output_tokens: 8192

litellm_settings:
  drop_params: true          # ignore Anthropic‑only params
  request_timeout: 600
  modify_params: true        # adapt params for OpenAI API

general_settings:
  disable_key_check: true    # no API key needed locally

Settings explained

SettingEffect
drop_params: trueSilently drops Anthropic‑specific fields that have no OpenAI counterpart.
modify_params: trueAllows LiteLLM to rewrite parameters (e.g., max_tokens) to match the target API’s expectations.
disable_key_check: trueSkips API‑key validation – useful when the server runs without authentication.

4. Running Claude Code with the local stack

export ANTHROPIC_BASE_URL="http://localhost:4000"

cd my-project
claude          # launches Claude Code as usual

Result: Identical user experience to the hosted Anthropic API, but all inference happens on‑prem.

Performance & Limits

MetricObservation
Throughput~25‑30 tokens / s on dual MI60, ~175 ms time‑to‑first‑token
Context windowCapped at 64 K tokens (Claude Opus can go to 200 K)
Model capabilityQwen3‑Coder excels at coding; Claude has broader general knowledge and instruction following.

Upsides

  • Zero API cost
  • Full data sovereignty (code never leaves your network)
  • Works on air‑gapped environments

5. End‑to‑End Test: Building a Flask Todo App

export ANTHROPIC_BASE_URL="http://localhost:4000"

cd /tmp && mkdir flask-test && cd flask-test
claude --dangerously-skip-permissions -p \
  "Build a Flask todo app with SQLite persistence, \
   modern UI with gradients and animations, \
   mobile responsive design, and full CRUD operations."

Generated project structure

flask_todo_app/
├── app.py              # Flask routes + SQLite setup
├── requirements.txt    # Dependencies
├── run_app.sh          # Launch script
├── static/
│   ├── css/
│   │   └── style.css   # Gradients, animations, hover effects
│   └── js/
│       └── script.js   # Client‑side interactions
└── templates/
    └── index.html      # Jinja2 template (responsive layout)

Sample app.py

from flask import Flask, render_template, request, redirect, url_for
import sqlite3

app = Flask(__name__)

def init_db():
    conn = sqlite3.connect('todos.db')
    c = conn.cursor()
    c.execute('''CREATE TABLE IF NOT EXISTS todos
                 (id INTEGER PRIMARY KEY AUTOINCREMENT,
                  task TEXT NOT NULL,
                  completed BOOLEAN DEFAULT FALSE)''')
    conn.commit()
    conn.close()

init_db()

@app.route('/')
def index():
    conn = sqlite3.connect('todos.db')
    c = conn.cursor()
    c.execute('SELECT id, task, completed FROM todos ORDER BY id DESC')
    todos = c.fetchall()
    conn.close()
    return render_template('index.html', todos=todos)

Sample CSS (static/css/style.css)

body {
    font-family: 'Poppins', sans-serif;
    background: linear-gradient(135deg, #667eea, #764ba2);
    min-height: 100vh;
    padding: 20px;
}

.container {
    max-width: 800px;
    margin: 0 auto;
}

.header {
    text-align: center;
    padding: 40px 0;
    color: white;
    text-shadow: 0 2px 4px rgba(0,0,0,0.1);
}

Running the app

cd flask_todo_app
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
python app.py   # starts Flask on http://localhost:5000

You can now add, toggle, and delete tasks; the SQLite DB persists across restarts.

6. TL;DR

  1. Set ANTHROPIC_BASE_URL to point at a LiteLLM proxy (http://localhost:4000).
  2. Run vLLM (Docker) with the Qwen3‑Coder model and the --enable-auto-tool-choice + --tool-call-parser qwen3_coder flags.
  3. Configure LiteLLM to map any claude-* model name to the local vLLM endpoint.
  4. Launch Claude Code as usual – it now talks to your on‑prem GPU, giving you zero‑cost, sovereign inference while preserving the full Claude Code workflow (including tool calls).

Enjoy secure, high‑performance coding assistance without ever sending your proprietary code off‑site!

Generation Overview

  • The generation took about five minutes across multiple agentic iterations.
  • Each file is a separate tool call: the model generates, Claude Code executes, the result returns, and the model plans the next step.
  • A 91 % prefix‑cache‑hit rate shows vLLM efficiently reusing context across the multi‑turn loop.

This confirms the agentic workflow functions correctly. The model reads the prompt, plans a file structure, emits tool calls to create directories and write files, and produces a functional application. All inference happens locally on the MI60s—no code leaves my network.

Limitations & Future Work

  • Scale – Not tested on larger codebases. A small Flask app is one thing; a multi‑thousand‑line refactor is another.
  • Context window – The 64 K token limit will eventually become a constraint, and the model may struggle with complex architectural decisions that the real Claude handles gracefully.
  • Current suitability – Works well for focused, scoped tasks.

Claude Code Compatibility Checklist

RequirementDetails
Strong tool useThe model must emit structured tool calls reliably
Code focusQwen3‑Coder works well; DeepSeek Coder and CodeLlama variants should also be viable
Sufficient contextI used 64 K; smaller windows may work but are untested

Observations

  • Qwen3‑Coder‑30B‑A3B handles straightforward coding tasks well.
  • For complex refactoring or architectural decisions, the real Claude API remains the better choice.

Hardware Tips

  • If you don’t have 64 GB of VRAM, smaller models like Qwen2.5‑Coder‑7B or Qwen3‑8B should fit on a single 16 GB or 24 GB card.
  • I haven’t tested these configurations, so I can’t comment on their context limits or how well they handle Claude Code’s agentic workflows.

Workflow Advice

  • Instead of broad “refactor this module” prompts, break work into tighter, more focused requests.
  • More prompts of narrower scope play to a smaller model’s strengths.

Full Docker‑Compose Configuration

services:
  vllm:
    image: nalanzeyu/vllm-gfx906:v0.11.2-rocm6.3
    container_name: vllm
    restart: unless-stopped
    ports:
      - "8000:8000"
    devices:
      - /dev/kfd:/dev/kfd
      - /dev/dri/card1:/dev/dri/card1
      - /dev/dri/card2:/dev/dri/card2
      - /dev/dri/renderD128:/dev/dri/renderD128
      - /dev/dri/renderD129:/dev/dri/renderD129
    group_add:
      - "44"
      - "992"
    shm_size: 16g
    volumes:
      - /mnt/cache/huggingface:/root/.cache/huggingface:rw
    environment:
      - HIP_VISIBLE_DEVICES=0,1
    command:
      - python
      - -m
      - vllm.entrypoints.openai.api_server
      - --model
      - QuantTrio/Qwen3-Coder-30B-A3B-Instruct-AWQ
      - --tensor-parallel-size
      - "2"
      - --max-model-len
      - "65536"
      - --gpu-memory-utilization
      - "0.9"
      - --host
      - "0.0.0.0"
      - --port
      - "8000"
      - --enable-auto-tool-choice
      - --tool-call-parser
      - qwen3_coder

  litellm:
    image: litellm/litellm:v1.80.15-stable
    container_name: litellm
    restart: unless-stopped
    ports:
      - "4000:4000"
    volumes:
      - ./litellm-config.yaml:/app/config.yaml:ro
    command:
      - --config
      - /app/config.yaml
      - --port
      - "4000"
      - --host
      - "0.0.0.0"
    depends_on:
      - vllm

Running the Stack

# Start (using nerdctl or Docker)
nerdctl compose -f coder.yaml up -d

From any machine on the network, point Claude Code at Feynman (the GPU workstation) and get local inference.

# When finished, tear it down
nerdctl compose -f coder.yaml down

Final Thoughts

  • This setup won’t replace the Claude API for everyone.
  • If you need maximum capability, Anthropic’s hosted models remain the best option.
  • For those who care about data sovereignty, local inference means proprietary code never leaves the network.
  • There’s also something satisfying about watching your own GPUs light up every time you ask Claude Code a question.
Back to Blog

Related posts

Read more »