Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

Published: 3 days ago (February 21, 2026 at 03:57 PM EST)

1 min read

Source: Hacker News

Question

Hi everyone, I’m kinda involved in some retrogaming and with some experiments I ran into the following question: “It would be possible to run transformer models bypassing the CPU/RAM, connecting the GPU to the NVMe?”

Solution Overview

This is the result of that question itself and some weekend vibecoding (the linked library repository is in the README as well). It seems to work, even on consumer GPUs, and should work even better on professional ones.

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

Question

Solution Overview

Further Reading

Related posts

I think WebRTC is better than SSH-ing for connecting to Mac terminal from iPhone

Show HN: Emdash – Open-source agentic development environment

Cardiorespiratory fitness is associated with lower anger and anxiety

Verge (YC S15) Is Hiring a Director of Computational Biology and AI Scientists/Eng