Show HN: ZSE – Open-source LLM inference engine with 3.9s cold starts

Published: (February 25, 2026 at 08:15 PM EST)
2 min read

Source: Hacker News

Problem I was trying to solve

Running a 32 B model normally requires ~64 GB VRAM, which most developers don’t have. Even when quantization helps with memory, cold starts with bitsandbytes NF4 take 2+ minutes on first load and 45–120 seconds on warm restarts — killing serverless and autoscaling use cases.

What ZSE does differently

  • 32 B model fits in 19.3 GB VRAM (70 % reduction vs FP16) – runs on a single A100‑40GB
  • 7 B model fits in 5.2 GB VRAM (63 % reduction) – runs on consumer GPUs
  • Native .zse pre‑quantized format with memory‑mapped weights:
    • 3.9 s cold start for 7 B
    • 21.4 s cold start for 32 B
    • vs. 45 s / 120 s with bitsandbytes, ~30 s for vLLM
  • All benchmarks verified on Modal A100‑80GB (Feb 2026)

Features

  • OpenAI‑compatible API server (drop‑in replacement)
  • Interactive CLI (zse serve, zse chat, zse convert, zse hardware)
  • Web dashboard with real‑time GPU monitoring
  • Continuous batching (3.45× throughput)
  • GGUF support via llama.cpp
  • CPU fallback – works without a GPU
  • Rate limiting, audit logging, API‑key authentication

Installation

pip install zllm-zse

Running a model

zse serve Qwen/Qwen2.5-7B-Instruct

Fast cold starts (one‑time conversion)

zse convert Qwen/Qwen2.5-Coder-7B-Instruct -o qwen-7b.zse
zse serve qwen-7b.zse   # 3.9 s every time

How the cold‑start improvement works

The .zse format stores pre‑quantized weights as memory‑mapped safetensors.

  • No quantization step at load time
  • No weight conversion, just mmap + GPU transfer

On NVMe SSDs this yields under 4 seconds for a 7 B model; on spinning HDDs it will be slower.

License

All code is real – no mock implementations. Built at Zyora Labs. Licensed under Apache 2.0.


Comments: (Points: 9)

0 views
Back to Blog

Related posts

Read more »

When does MCP make sense vs CLI?

I’m going to make a bold claim: MCP is already dying. We may not fully realize it yet, but the signs are there. OpenClaw doesn’t support it. Pi doesn’t support...