Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts
Source: Hacker News
Problem I was trying to solve
Running a 32 B model normally requires ~64 GB VRAM, which most developers don’t have. Even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — killing serverless and autoscaling use cases.
What ZSE does differently
- 32 B model fits in 19.3 GB VRAM (70 % reduction vs FP16) – runs on a single A100‑40GB
- 7 B model fits in 5.2 GB VRAM (63 % reduction) – runs on consumer GPUs
- Native
.zsepre‑quantized format with memory‑mapped weights:- 3.9 s cold start for 7 B
- 21.4 s cold start for 32 B
- vs. 45 s / 120 s with bitsandbytes, ~30 s for vLLM
- All benchmarks verified on Modal A100‑80GB (Feb 2026)
Features
- OpenAI‑compatible API server (drop‑in replacement)
- Interactive CLI (
zse serve,zse chat,zse convert,zse hardware) - Web dashboard with real‑time GPU monitoring
- Continuous batching (3.45× throughput)
- GGUF support via
llama.cpp - CPU fallback – works without a GPU
- Rate limiting, audit logging, API‑key authentication
Installation
pip install zllm-zse
Running a model
zse serve Qwen/Qwen2.5-7B-Instruct
Fast cold starts (one‑time conversion)
zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse
zse serve qwen-7b.zse # 3.9 s every time
How the cold‑start improvement works
The .zse format stores pre‑quantized weights as memory‑mapped safetensors.
- No quantization step at load time
- No weight conversion, just
mmap+ GPU transfer
On NVMe SSDs this yields under 4 seconds for a 7 B model; on spinning HDDs it will be slower.
License
All code is real – no mock implementations. Built at Zyora Labs. Licensed under Apache 2.0.
Comments: (Points: 9)