I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards

Published: 2 days ago (May 3, 2026 at 12:42 AM EDT)

6 min read

Source: Dev.to

Overview

I bought four NVIDIA CMP 100‑210 cards on the second‑hand market for about $130 each.
These are ex‑mining cards based on the Volta GV100 die – the same silicon as the V100 – with 16 GB of HBM2 each. On paper, four of them give me 64 GB of HBM2 for the price of a single used RTX 3090.

In practice, NVIDIA crippled them in hardware.

The Throttle

Tensor‑core throttling: 64× slowdown.
- HMMA latency stretched from 8 cycles → 512 cycles.
- cuBLAS WMMA caps at ≈ 5 TFLOP per card.
PCIe: locked to Gen1 ×1.
No P2P, no NVLink.
CUPTI: blocked → NVIDIA’s profiler unusable.

The throttle is enforced by an e‑fuse + PMU boot‑ROM double‑lock on the die – a hardware‑level lock, not a firmware switch. There is no software unlock (I tried).

Result: Anything that goes through cuBLAS tensor cores runs at 1/64 speed or fails outright. This includes:

vLLM
llama.cpp’s default cuBLAS path
FlashAttention
bitsandbytes
PyTorch’s default matmul

The standard LLM inference stack is therefore unusable on this hardware.

The Work‑around

Only the tensor cores are throttled. Two other execution paths on the same chip run at full speed:

Path	Description	Performance	Throttle
DP4A	4‑way packed int8 dot‑product	≈ 17 TFLOP	No
HFMA2	2‑way packed fp16 fused multiply‑add	≈ 24 TFLOP	No

Neither matches a healthy V100’s tensor cores, but both are far above the 5 TFLOP cuBLAS WMMA ceiling. By routing all inference through these two paths we can recover ≈ 50 % of an unthrottled V100’s performance – still vastly better than nothing.

Introducing qengine

qengine is a from‑scratch CUDA inference engine for Qwen 3.5 / Qwen 3.6 hybrid models. (Note: Qwen 3.5/3.6 use a dense GDN + Attention architecture, not pure transformers, so the kernels differ.)

Features

Hand‑written Q8_0 GEMM tile path for prefill – all DP4A.
Fused FlashAttention kernel (score + softmax + value) – single‑pass.
Split‑K FlashAttention for long contexts.
3‑bit Walsh‑Hadamard + Lloyd‑Max KV cache → 27 B model fits 256 K context on three 16 GB cards.
OpenAI‑compatible HTTP API with streaming, tool calls, vision, continuous batching, and per‑slot prefix caching.

All kernels are written for sm_70 (CMP constraints) and are not forks of existing libraries.

Honest Benchmarks

Comparison: qengine vs. llama.cpp (build 8462, -fa 1, same Q8_0 GGUFs, same hardware). Higher numbers = better.

Prefill

Model / Prefill Length	Tokens / s (qengine)	Tokens / s (llama.cpp)	Speed‑up
Qwen 3.5‑9B – 297 tokens	594	199	2.99×
Qwen 3.5‑9B – 1.16K tokens	683	316	2.16×
Qwen 3.5‑9B – 4.62K tokens	584	361	1.62×
Qwen 3.5‑9B – 18K tokens	393	324	1.22×

qengine leads for the first three lengths and reaches parity at 18 K.

Generation Throughput

Model	Tokens / s (qengine)	Tokens / s (llama.cpp)	Gain
9 B	≈ 70	46.6	+48 %
27 B	26.3	17.7	+51 %

Weak point: 9 B dual‑GPU at 18 K still trails llama.cpp (~0.48×). The reason is that llama.cpp overlaps activation transfer with compute, while qengine must transfer sequentially through pinned host memory (no P2P). Single‑GPU 9 B is already faster than either dual‑GPU run, so the gap is mostly theoretical.

Implementation Challenges

1. Multi‑GPU without P2P

CMP cards lack peer‑to‑peer and NVLink.
Hidden state must bounce through pinned host memory between GPUs.
Solution: a pinned‑host buffer per cross‑GPU edge + a worker thread per GPU. Works, but is sequential.

2. Numerical Drift Killing Korean Output

Qwen 3.5‑9B’s Korean circuits are weak; small fp16 reorder noise can flip argmax decisions, producing garbled Korean.
After a chunked‑prefill kernel optimisation that passed English tests, Korean broke.
Fix: every kernel that touches the attention‑reduction order now runs a Korean argmax‑stability check before shipping.

3. Split‑K FlashAttention without Breaking Determinism

Original 64‑block FA grid under‑utilised SMs at long context (only 64 blocks across 3 × 68 SM = 204).
Added a split‑K variant mapping each (kv_head, t_idx) to N independent blocks, each handling a contiguous tile range.
Merged partials using the log‑sum‑exp identity:

m_global = max_s m_s
l_global = Σ_s exp(m_s − m_global) · l_s
o_global = Σ_s exp(m_s − m_global) · acc_o_s

First version stored partial o accumulators as half → drift after ~31 generated tokens (not bit‑exact).
Storing partials as fp32 reduces drift to fp32‑reordering noise (~1e‑7 per add) → greedy argmax stable across 32+ tokens.
Result: 18 K prefill went from 270 → 393 t/s (9 B) and 104 → 139 t/s (27 B).

4. Speculative Decoding (still broken)

Repo contains DFlash + DDTree code for a future fine‑tuned drafter.
Pre‑trained drafter (lucebox‑hub/dflash) trained on stock Qwen 3.5; our distill output distribution mismatches → accept rate ≈ 0 % and chains degenerate.
Marked as broken on purpose in README.
MTP K=1 single‑token spec works fine.

Who Should Use qengine?

Situation	Recommendation
RTX 30/40‑series, A100, H100	Use vLLM or SGLang – they are far more optimized and have extensive test coverage.
Ex‑mining cards (CMP 100‑210, ex‑mining V100, P104‑100, etc.)	qengine may be useful.
Older Volta workstations (V100 16/32 GB, Titan V, Quadro GV100)	qengine works (targets `sm_70`).
T4 or RTX 20‑series where standard stacks disappoint	qengine could help.
GPUs without DP4A (e.g., `sm_60`)	Not supported.
AMD / Apple GPUs	Not supported.

qengine targets sm_70 specifically. sm_75 should work but isn’t tuned; sm_60 lacks DP4A and therefore cannot run it.

Silicon Definitely Won’t Work

Repo
https://github.com/Haru-neo/qengine — Apache 2.0

The benchmarks in this post are reproducible with the bench_curl.sh script in the repo. The 27 B 3‑GPU numbers were

measured 2026‑05‑03 on my machine.
If you have the hardware and try it, I’d love to know what you see.

Project Details

Solo project.
Heavy AI assistance on the CUDA — I drove the architecture, profiling, and debugging across many sessions; Claude did most of the kernel implementation.
I’m a Korean high‑school student.
Slow PR turnaround.