I wrote a custom CUDA inference engine to run Qwen3.5-27B on $130 mining cards
Source: Dev.to
Overview
I bought four NVIDIA CMP 100‑210 cards on the second‑hand market for about $130 each.
These are ex‑mining cards based on the Volta GV100 die – the same silicon as the V100 – with 16 GB of HBM2 each. On paper, four of them give me 64 GB of HBM2 for the price of a single used RTX 3090.
In practice, NVIDIA crippled them in hardware.
The Throttle
- Tensor‑core throttling: 64× slowdown.
- HMMA latency stretched from 8 cycles → 512 cycles.
cuBLASWMMA caps at ≈ 5 TFLOP per card.
- PCIe: locked to Gen1 ×1.
- No P2P, no NVLink.
- CUPTI: blocked → NVIDIA’s profiler unusable.
The throttle is enforced by an e‑fuse + PMU boot‑ROM double‑lock on the die – a hardware‑level lock, not a firmware switch. There is no software unlock (I tried).
Result: Anything that goes through cuBLAS tensor cores runs at 1/64 speed or fails outright. This includes:
- vLLM
llama.cpp’s defaultcuBLASpath- FlashAttention
- bitsandbytes
- PyTorch’s default
matmul
The standard LLM inference stack is therefore unusable on this hardware.
The Work‑around
Only the tensor cores are throttled. Two other execution paths on the same chip run at full speed:
| Path | Description | Performance | Throttle |
|---|---|---|---|
| DP4A | 4‑way packed int8 dot‑product | ≈ 17 TFLOP | No |
| HFMA2 | 2‑way packed fp16 fused multiply‑add | ≈ 24 TFLOP | No |
Neither matches a healthy V100’s tensor cores, but both are far above the 5 TFLOP cuBLAS WMMA ceiling. By routing all inference through these two paths we can recover ≈ 50 % of an unthrottled V100’s performance – still vastly better than nothing.
Introducing qengine
qengine is a from‑scratch CUDA inference engine for Qwen 3.5 / Qwen 3.6 hybrid models. (Note: Qwen 3.5/3.6 use a dense GDN + Attention architecture, not pure transformers, so the kernels differ.)
Features
- Hand‑written Q8_0 GEMM tile path for prefill – all DP4A.
- Fused FlashAttention kernel (score + softmax + value) – single‑pass.
- Split‑K FlashAttention for long contexts.
- 3‑bit Walsh‑Hadamard + Lloyd‑Max KV cache → 27 B model fits 256 K context on three 16 GB cards.
- OpenAI‑compatible HTTP API with streaming, tool calls, vision, continuous batching, and per‑slot prefix caching.
All kernels are written for sm_70 (CMP constraints) and are not forks of existing libraries.
Honest Benchmarks
Comparison: qengine vs. llama.cpp (build 8462, -fa 1, same Q8_0 GGUFs, same hardware). Higher numbers = better.
Prefill
| Model / Prefill Length | Tokens / s (qengine) | Tokens / s (llama.cpp) | Speed‑up |
|---|---|---|---|
| Qwen 3.5‑9B – 297 tokens | 594 | 199 | 2.99× |
| Qwen 3.5‑9B – 1.16K tokens | 683 | 316 | 2.16× |
| Qwen 3.5‑9B – 4.62K tokens | 584 | 361 | 1.62× |
| Qwen 3.5‑9B – 18K tokens | 393 | 324 | 1.22× |
qengine leads for the first three lengths and reaches parity at 18 K.
Generation Throughput
| Model | Tokens / s (qengine) | Tokens / s (llama.cpp) | Gain |
|---|---|---|---|
| 9 B | ≈ 70 | 46.6 | +48 % |
| 27 B | 26.3 | 17.7 | +51 % |
Weak point: 9 B dual‑GPU at 18 K still trails llama.cpp (~0.48×). The reason is that llama.cpp overlaps activation transfer with compute, while qengine must transfer sequentially through pinned host memory (no P2P). Single‑GPU 9 B is already faster than either dual‑GPU run, so the gap is mostly theoretical.
Implementation Challenges
1. Multi‑GPU without P2P
- CMP cards lack peer‑to‑peer and NVLink.
- Hidden state must bounce through pinned host memory between GPUs.
- Solution: a pinned‑host buffer per cross‑GPU edge + a worker thread per GPU. Works, but is sequential.
2. Numerical Drift Killing Korean Output
- Qwen 3.5‑9B’s Korean circuits are weak; small fp16 reorder noise can flip argmax decisions, producing garbled Korean.
- After a chunked‑prefill kernel optimisation that passed English tests, Korean broke.
- Fix: every kernel that touches the attention‑reduction order now runs a Korean argmax‑stability check before shipping.
3. Split‑K FlashAttention without Breaking Determinism
- Original 64‑block FA grid under‑utilised SMs at long context (only 64 blocks across 3 × 68 SM = 204).
- Added a split‑K variant mapping each
(kv_head, t_idx)toNindependent blocks, each handling a contiguous tile range. - Merged partials using the log‑sum‑exp identity:
m_global = max_s m_s
l_global = Σ_s exp(m_s − m_global) · l_s
o_global = Σ_s exp(m_s − m_global) · acc_o_s
- First version stored partial
oaccumulators as half → drift after ~31 generated tokens (not bit‑exact). - Storing partials as fp32 reduces drift to fp32‑reordering noise (~1e‑7 per add) → greedy argmax stable across 32+ tokens.
- Result: 18 K prefill went from 270 → 393 t/s (9 B) and 104 → 139 t/s (27 B).
4. Speculative Decoding (still broken)
- Repo contains DFlash + DDTree code for a future fine‑tuned drafter.
- Pre‑trained drafter (
lucebox‑hub/dflash) trained on stock Qwen 3.5; our distill output distribution mismatches → accept rate ≈ 0 % and chains degenerate. - Marked as broken on purpose in README.
- MTP K=1 single‑token spec works fine.
Who Should Use qengine?
| Situation | Recommendation |
|---|---|
| RTX 30/40‑series, A100, H100 | Use vLLM or SGLang – they are far more optimized and have extensive test coverage. |
| Ex‑mining cards (CMP 100‑210, ex‑mining V100, P104‑100, etc.) | qengine may be useful. |
| Older Volta workstations (V100 16/32 GB, Titan V, Quadro GV100) | qengine works (targets sm_70). |
| T4 or RTX 20‑series where standard stacks disappoint | qengine could help. |
GPUs without DP4A (e.g., sm_60) | Not supported. |
| AMD / Apple GPUs | Not supported. |
qengine targets sm_70 specifically. sm_75 should work but isn’t tuned; sm_60 lacks DP4A and therefore cannot run it.
Silicon Definitely Won’t Work
Repo
https://github.com/Haru-neo/qengine — Apache 2.0
The benchmarks in this post are reproducible with the bench_curl.sh script in the repo. The 27 B 3‑GPU numbers were
- measured 2026‑05‑03 on my machine.
- If you have the hardware and try it, I’d love to know what you see.
Project Details
- Solo project.
- Heavy AI assistance on the CUDA — I drove the architecture, profiling, and debugging across many sessions; Claude did most of the kernel implementation.
- I’m a Korean high‑school student.
- Slow PR turnaround.