Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Published: 3 days ago (February 25, 2026 at 08:15 PM EST)

2 min read

Source: Hacker News

Problem I was trying to solve

Running a 32 B model normally requires ~64 GB VRAM, which most developers don’t have. Even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — killing serverless and autoscaling use cases.

What ZSE does differently

32 B model fits in 19.3 GB VRAM (70 % reduction vs FP16) – runs on a single A100‑40GB
7 B model fits in 5.2 GB VRAM (63 % reduction) – runs on consumer GPUs
Native .zse pre‑quantized format with memory‑mapped weights:
- 3.9 s cold start for 7 B
- 21.4 s cold start for 32 B
- vs. 45 s / 120 s with bitsandbytes, ~30 s for vLLM
All benchmarks verified on Modal A100‑80GB (Feb 2026)

Features

OpenAI‑compatible API server (drop‑in replacement)
Interactive CLI (zse serve, zse chat, zse convert, zse hardware)
Web dashboard with real‑time GPU monitoring
Continuous batching (3.45× throughput)
GGUF support via llama.cpp
CPU fallback – works without a GPU
Rate limiting, audit logging, API‑key authentication

Installation

pip install zllm-zse

Running a model

zse serve Qwen/Qwen2.5-7B-Instruct

Fast cold starts (one‑time conversion)

zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse
zse serve qwen-7b.zse   # 3.9 s every time

How the cold‑start improvement works

The .zse format stores pre‑quantized weights as memory‑mapped safetensors.

No quantization step at load time
No weight conversion, just mmap + GPU transfer

On NVMe SSDs this yields under 4 seconds for a 7 B model; on spinning HDDs it will be slower.

License

All code is real – no mock implementations. Built at Zyora Labs. Licensed under Apache 2.0.

Comments: (Points: 9)

Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Problem I was trying to solve

What ZSE does differently

Features

Installation

Running a model

Fast cold starts (one‑time conversion)

How the cold‑start improvement works

License

Related posts

AWS Middle East Central Down, apparently struck in war

A new account made over $515,000 betting on the U.S. strike against Iran

A new Polymarket account made over $500k betting on the U.S. strike against Iran

When does MCP make sense vs CLI?