How to Setup a Local Coding Agent on macOS

Published: (June 12, 2026 at 01:34 PM EDT)
8 min read

Source: Hacker News

I’d had my internet fail a few times recently leaving me stranded without a coding agent, and so when I saw the “Gemma 4 now runs 2x faster with MTP” Multi-Token Prediction update for Gemma 4 I decided to have a go at getting it running.

I wanted a local coding agent setup that:

  • was fast enough to actually use on my Mac

  • worked through an OpenAI compatible API (so I could use it in other tools)

  • and preferably could handle screenshots/images when needed, so I can feed it screenshots of what it has made.

And I did! This video is realtime. And shows the agent responding at a perfectly usable speed.

After a bit of testing the final setup I ended up with is:

  • llama.cpp built with Metal on macOS

  • Gemma 4 26B-A4B in GGUF format

  • A Q8 MTP draft model for speculative decoding

  • The Gemma 4 multimodal projector

  • Pi as the terminal coding agent

This was tested on an Apple M1 Max with 64 GB unified memory, running macOS 15.7.7.

The Model

The main model is: gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf.

Link on Huggingface: models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

That file is about 16 GB. With the MTP draft head and multimodal projector the model folder is about 17 GB.

The benchmark prompt was:

Write a compact Python function that parses a unified diff and returns the changed file paths. Then explain two edge cases.

Each benchmark generated about 128 tokens.

Baseline: llama.cpp + Metal

First I ran the main model directly through llama.cpp with Metal acceleration:

repos/llama.cpp/build/bin/llama-cli
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
-ngl 999
-fa on
-c 4096
-n 128

Result:

Setup Prompt tok/s Generation tok/s

Gemma 4 26B-A4B Q4, llama.cpp Metal 298.0 58.2

58 tokens/second is not fast, but is usable, but for coding-agent work you want it to be as fast as possible, especially when the agent is making many tool calls.

Adding the MTP Draft Model

Gemma 4 now has the MTP draft model available:

MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

This can be loaded by llama.cpp as a speculative draft model:

repos/llama.cpp/build/bin/llama-cli
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
—model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf
—spec-type draft-mtp
—spec-draft-n-max 3
-ngl 999
-fa on
-c 4096
-n 128

The first run with MTP came in at 69.2 tokens/second using 4 draft tokens. However, Unsloth’s guide on How to Run MTP Models includes this note:

“We found —spec-draft-n-max 2 is the best starting point however, do not assume 2 is optimal, as performance is hardware-dependent. Try any value from 1 through 6 and use whichever is fastest for your system.”

After sweeping --spec-draft-n-max, the best result was 72.2 tokens/second with 3 draft tokens.

Setup Prompt tok/s Generation tok/s Speedup

Main model only 298.0 58.2 1.00x

Main model + Q8 MTP draft 295.6 72.2 1.24x

The useful part is that prompt processing stayed basically the same, while generation improved by about 24%.

Tuning MTP

I tested --spec-draft-n-max values from 1 to 6.

--spec-draft-n-max Prompt tok/s Generation tok/s

1 295.5 68.4

2 299.1 72.0

3 295.6 72.2

4 297.3 70.7

5 297.9 63.7

6 296.3 61.2

On my M1 Max machine, 3 was the fastest, with 2 close enough that either would be fine. Values above that got slower.

MLX Comparison

I also tested MLX models through mlx-lm, to find out which is the faster way to run the model on a Mac, llama.cpp or mlx.

Runtime Model Generation tok/s

llama.cpp Metal + MTP Unsloth GGUF Q4 + Q8 MTP 72.2

llama.cpp Metal Unsloth GGUF Q4 58.2

MLX-LM Unsloth UD MLX 4-bit 45.8

MLX-LM mlx-community 4-bit 43.9

MLX-LM mlx-community OptiQ 4-bit 38.1

I thought MLX (being optimised for the Mac) would be fastest.

However, for this specific setup, llama.cpp was faster than MLX, and llama.cpp with MTP was clearly the best option.

I guess all the effort and tweaking which has gone into llama.cpp over time means it quite well optimised fr macOS despite being cross platform.

I also tried Gemma 4 MTP through gemma-4-swift-mlx, but the tested 26B 4-bit MLX checkpoints did not match the loader’s expected weight keys, and I already had the previous MLX tests, so moved on rather than redownload new models and try to tweak things to match.

Adding Image Support

For Pi, I also wanted to be able to attach screenshots. The local model entry I setup for it originally declared the model as text-only:

“input”: [“text”]

That meant Pi did not send image tool output through to the model properly.

The llama.cpp server also needs the Gemma 4 multimodal projector in order for the multi-modal part to work (only the 12B is natively multi-modal):

mmproj-BF16.gguf

When loaded with --mmproj, llama.cpp advertises multimodal support, and Pi can send images.

I re-ran the text benchmark with the projector loaded, just to check it didn’t change the speed:

Setup Projector Prompt tok/s Generation tok/s

llama.cpp Metal + MTP none 120.3 71.4

llama.cpp Metal + MTP mmproj-BF16.gguf 297.4 72.2

The final run with the projector did not show a text-generation slowdown.

Now for setup instructions:

Install llama.cpp

Install dependencies:

brew install cmake git tmux python@3.11

Clone and build llama.cpp:

mkdir -p ~/Developer/ML-Models/Gemma4/repos cd ~/Developer/ML-Models/Gemma4

git clone https://github.com/ggml-org/llama.cpp repos/llama.cpp

cd repos/llama.cpp cmake -B build
-DCMAKE_BUILD_TYPE=Release
-DGGML_METAL=ON
-DGGML_ACCELERATE=ON

cmake —build build —config Release -j

The build I tested had:

GGML_METAL=ON GGML_ACCELERATE=ON GGML_BLAS=ON GGML_BLAS_VENDOR=Apple

Download the Model Files

Create a Python environment:

cd ~/Developer/ML-Models/Gemma4 python3.11 -m venv .venv source .venv/bin/activate pip install -U huggingface_hub hf_xet

Download the files:

mkdir -p models/unsloth-gemma-4-26B-A4B-it-GGUF

huggingface-cli download unsloth/gemma-4-26B-A4B-it-GGUF
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
mmproj-BF16.gguf
MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf
—local-dir models/unsloth-gemma-4-26B-A4B-it-GGUF

You should end up with:

models/unsloth-gemma-4-26B-A4B-it-GGUF/ gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf mmproj-BF16.gguf MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf

Start the Local Server

This is the final server command:

repos/llama.cpp/build/bin/llama-server
-m models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
—model-draft models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf
—mmproj models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf
—spec-type draft-mtp
—spec-draft-n-max 3
-ngl 999
-fa on
-c 65536
—parallel 1
—host 127.0.0.1
—port 8080

The OpenAI-compatible endpoint is:

http://127.0.0.1:8080/v1

I used a small start_server.sh wrapper so it runs inside tmux:

#!/usr/bin/env bash set -euo pipefail

ROOT_DIR=”$(cd ”$(dirname ”${BASH_SOURCE[0]}”)” && pwd)” SESSION_NAME=”${SESSION_NAME:-gemma4-server}” HOST=”${HOST:-127.0.0.1}” PORT=”${PORT:-8080}” CTX_SIZE=”${CTX_SIZE:-65536}” PARALLEL=”${PARALLEL:-1}”

LLAMA_SERVER=“$ROOT_DIR/repos/llama.cpp/build/bin/llama-server” MODEL=“$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf” DRAFT_MODEL=“$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/MTP/gemma-4-26B-A4B-it-Q8_0-MTP.gguf” MMPROJ=“$ROOT_DIR/models/unsloth-gemma-4-26B-A4B-it-GGUF/mmproj-BF16.gguf” LOG_FILE=“$ROOT_DIR/logs/llama-server-mtp.log”

mkdir -p “$ROOT_DIR/logs”

tmux new-session -d -s “$SESSION_NAME” -c “$ROOT_DIR”
“$LLAMA_SERVER
-m ‘$MODEL’
—model-draft ‘$DRAFT_MODEL’
—mmproj ‘$MMPROJ’
—spec-type draft-mtp
—spec-draft-n-max 3
-ngl 999
-fa on
-c ‘$CTX_SIZE’
—parallel ‘$PARALLEL’
—host ‘$HOST’
—port ‘$PORT’
2>&1 | tee -a ‘$LOG_FILE’”

Start it:

chmod +x start_server.sh ./start_server.sh

Check that the server is running:

curl http://127.0.0.1:8080/v1/models

Configure Pi

Pi reads model providers from:

~/.pi/agent/models.json

Add a local provider:

{ “providers”: { “gemma4-local”: { “name”: “Gemma 4 Local”, “baseUrl”: “http://127.0.0.1:8080/v1”, “api”: “openai-completions”, “apiKey”: “local”, “authHeader”: false, “compat”: { “supportsDeveloperRole”: false, “supportsReasoningEffort”: false }, “models”: [ { “id”: “gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”, “name”: “Gemma 4 26B-A4B Q4 + MTP”, “reasoning”: false, “input”: [“text”, “image”], “contextWindow”: 65536, “maxTokens”: 8192, “cost”: { “input”: 0, “output”: 0, “cacheRead”: 0, “cacheWrite”: 0 } } ] } } }

The important pieces are:

  • baseUrl points to the llama.cpp OpenAI-compatible server.

  • api is openai-completions.

  • authHeader is false, because this is a local server.

  • input includes both text and image, otherwise Pi treats it as text-only.

Optionally make it the default in:

~/.pi/agent/settings.json

{ “defaultProvider”: “gemma4-local”, “defaultModel”: “gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf”, “defaultThinkingLevel”: “minimal” }

Then check Pi can see it:

pi —offline —list-models gemma

Expected:

provider model context max-out thinking images gemma4-local gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf 65.5K 8.2K no yes

Run Pi using the local model:

pi —provider gemma4-local —model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Or use non-interactive mode:

pi -p —provider gemma4-local —model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
“Explain what this repository does”

For screenshots:

pi -p @“/path/to/screenshot.png” “Describe this image and point out anything relevant to the UI”

Final Setup

The final local coding-agent stack was:

Layer Choice

Inference runtime llama.cpp

macOS acceleration Metal + Accelerate

Main model gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf

Draft model gemma-4-26B-A4B-it-Q8_0-MTP.gguf

MTP setting --spec-draft-n-max 3

Multimodal projector mmproj-BF16.gguf

Server llama-server on 127.0.0.1:8080

API OpenAI-compatible /v1

Coding agent Pi

Pi model input ["text", "image"]

The main conclusion was that the MTP draft model is worth using. On this machine it took Gemma 4 from 58.2 tokens/second to 72.2 tokens/second, while keeping the setup simple enough to run as a local OpenAI-compatible server.

P.S: Some suggested using Qwen3.6 35B-A3B instead of Gemma 4 26B-A4B. According to the benchmarks I can find, Qwen is a much better coding agent than Gemma 4.

However, it is also slower. Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf + unsloth-Qwen3.6-35B-A3B-MTP-GGUF + mmproj-BF16.gguf results in 55 tk/s, instead of 72 tk/s. Which is quite significant when you are sitting waiting for it.

Download the models:

mkdir -p models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF

huggingface-cli download unsloth/Qwen3.6-35B-A3B-MTP-GGUF
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
mmproj-BF16.gguf
—local-dir models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF

Start the server:

LLAMA_SERVER=/Users/kylehowells/Developer/ML-Models/Gemma4/repos/llama.cpp/build/bin/llama-server

$LLAMA_SERVER
-m models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
—mmproj models/unsloth-Qwen3.6-35B-A3B-MTP-GGUF/mmproj-BF16.gguf
—spec-type draft-mtp
—spec-draft-n-max 3
-ngl 999
-fa on
-c 65536
—parallel 1
—host 127.0.0.1
—port 8081

Pi Config:

{ “providers”: { “qwen36-local”: { “name”: “Qwen3.6 Local”, “baseUrl”: “http://127.0.0.1:8081/v1”, “api”: “openai-completions”, “apiKey”: “local”, “authHeader”: false, “compat”: { “supportsDeveloperRole”: false, “supportsReasoningEffort”: false }, “models”: [ { “id”: “Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf”, “name”: “Qwen3.6 35B-A3B Q4 + MTP”, “reasoning”: true, “input”: [“text”, “image”], “contextWindow”: 65536, “maxTokens”: 8192, “cost”: { “input”: 0, “output”: 0, “cacheRead”: 0, “cacheWrite”: 0 } } ] } } }

References:

0 views
Back to Blog

Related posts

Read more »

Chaosnet (1981)

1 Introduction ¶Introduction Chaosnet is a local network, that is, a system for communication among a group of computers located within one or two kilometers o...

Rome Fell and Nobody Noticed

When I first began learning about the Roman Empire in middle school, I was most interested in what everyone else seems to be interested in — the time of Caesar...