An Affordable AI Server

Published: 3 months ago (January 31, 2026 at 12:44 PM EST)

3 min read

Source: Dev.to

Source: Dev.to

Introduction

Two AMD MI60s from eBay cost me about $1,000 total and gave me 64 GB of VRAM, which is enough to run Llama 3.3 70B at home with a 32K context window.

When I started looking into running large language models locally, the obvious limiting factor was VRAM. Consumer GPUs top out at 24 GB (e.g., an RTX 4090). I wanted to run 70 B‑parameter models locally on hardware I own.

Why the MI60?

The MI60 is a 2018 server GPU that AMD built for datacenters. It has 32 GB of HBM2 memory—the same high‑bandwidth memory you find in modern AI accelerators—and you can pick one up for around $500 on eBay. Two of them give you 64 GB of VRAM, more than enough for Llama 3.3 70B.

Pros

Memory: 32 GB HBM2 per card, higher theoretical bandwidth than GDDR6X.
Cost: Roughly $500 per card on the secondary market, cheaper than high‑end consumer GPUs with comparable memory.
Inference performance: For memory‑bound inference workloads, the extra memory and bandwidth matter more than raw compute throughput.

Cons

Cooling: Passive‑cooled cards designed for server chassis with serious airflow. In a regular PC case they thermal‑throttle within minutes.
PCIe bottleneck: With two cards doing tensor parallelism, PCIe can become the limiting factor.
Software support: AMD stopped actively developing for the gfx906 architecture, though backward compatibility remains.

Cooling Solution

I 3D‑printed a duct and set up a push‑pull configuration:

Intake: 120 mm fan inside the case blowing air across the heatsinks.
Exhaust: 92 mm fan on the rear pulling hot air out.

A custom fan‑controller script keeps the fans in sync with GPU utilization, maintaining junction temperatures around 80 °C instead of the 97 °C I saw before fixing the cooling.

Software Stack

ROCm: Running ROCm 6.3 without issues; years of bug fixes have made the platform stable.
Inference framework: vLLM gave the best experience. I tried Ollama first, but performance was noticeably worse and tensor parallelism across both GPUs wasn’t as smooth. vLLM provides better speeds, though switching models isn’t as simple as Ollama’s pull‑and‑run workflow (I built a custom solution for that).

Performance Numbers

Running vLLM with AWQ‑quantized models on the dual‑MI60 setup:

Model	Tokens / sec	GPUs (tensor parallel)
Qwen3 8B	~90	1
Qwen3 32B	~31	1
Llama 3.3 70B	~26	2 (tensor parallel)

The 8 B and 32 B models respond quickly, and even the 70 B model is very usable.

Cost Comparison

Most dual‑GPU consumer setups max out at 48 GB of VRAM. Two MI60s give you 64 GB for around $1,000. You’ll need to solve the cooling (see above), but it’s a one‑time fix.

Future Work

I’ll be writing more about this setup:

Detailed cooling solution
Full software stack walkthrough
Model‑switching workflow

Spoiler: Stable Diffusion still locks up the GPU, and I haven’t gotten Whisper working yet.

Alternative GPUs

The MI60 isn’t the only option. Other cards floating around the secondary market include:

AMD MI50, MI100
Various NVIDIA Tesla models

When choosing, consider memory capacity, compute performance, and software support.

An Affordable AI Server

Introduction

Why the MI60?

Pros

Cons

Cooling Solution

Software Stack

Performance Numbers

Cost Comparison

Future Work

Alternative GPUs

Related posts

Introducing nono: A Secure Sandbox for AI Agents

Switch Claude Code providers in seconds with claude-provider (Plugin + CLI)

How to Set Up OpenClaw in 5-10 Minutes (No Mac Mini, No VPS, No Code)

Debugging My Brain: Why Procrastination is Actually an 'Emotional Regulation' Glitch