An Affordable AI Server
Source: Dev.to
Introduction
Two AMD MI60s from eBay cost me about $1,000 total and gave me 64 GB of VRAM, which is enough to run Llama 3.3 70B at home with a 32K context window.
When I started looking into running large language models locally, the obvious limiting factor was VRAM. Consumer GPUs top out at 24 GB (e.g., an RTX 4090). I wanted to run 70 B‑parameter models locally on hardware I own.
Why the MI60?
The MI60 is a 2018 server GPU that AMD built for datacenters. It has 32 GB of HBM2 memory—the same high‑bandwidth memory you find in modern AI accelerators—and you can pick one up for around $500 on eBay. Two of them give you 64 GB of VRAM, more than enough for Llama 3.3 70B.
Pros
- Memory: 32 GB HBM2 per card, higher theoretical bandwidth than GDDR6X.
- Cost: Roughly $500 per card on the secondary market, cheaper than high‑end consumer GPUs with comparable memory.
- Inference performance: For memory‑bound inference workloads, the extra memory and bandwidth matter more than raw compute throughput.
Cons
- Cooling: Passive‑cooled cards designed for server chassis with serious airflow. In a regular PC case they thermal‑throttle within minutes.
- PCIe bottleneck: With two cards doing tensor parallelism, PCIe can become the limiting factor.
- Software support: AMD stopped actively developing for the gfx906 architecture, though backward compatibility remains.
Cooling Solution
I 3D‑printed a duct and set up a push‑pull configuration:
- Intake: 120 mm fan inside the case blowing air across the heatsinks.
- Exhaust: 92 mm fan on the rear pulling hot air out.
A custom fan‑controller script keeps the fans in sync with GPU utilization, maintaining junction temperatures around 80 °C instead of the 97 °C I saw before fixing the cooling.
Software Stack
- ROCm: Running ROCm 6.3 without issues; years of bug fixes have made the platform stable.
- Inference framework:
vLLMgave the best experience. I tried Ollama first, but performance was noticeably worse and tensor parallelism across both GPUs wasn’t as smooth.vLLMprovides better speeds, though switching models isn’t as simple as Ollama’s pull‑and‑run workflow (I built a custom solution for that).
Performance Numbers
Running vLLM with AWQ‑quantized models on the dual‑MI60 setup:
| Model | Tokens / sec | GPUs (tensor parallel) |
|---|---|---|
| Qwen3 8B | ~90 | 1 |
| Qwen3 32B | ~31 | 1 |
| Llama 3.3 70B | ~26 | 2 (tensor parallel) |
The 8 B and 32 B models respond quickly, and even the 70 B model is very usable.
Cost Comparison
Most dual‑GPU consumer setups max out at 48 GB of VRAM. Two MI60s give you 64 GB for around $1,000. You’ll need to solve the cooling (see above), but it’s a one‑time fix.
Future Work
I’ll be writing more about this setup:
- Detailed cooling solution
- Full software stack walkthrough
- Model‑switching workflow
Spoiler: Stable Diffusion still locks up the GPU, and I haven’t gotten Whisper working yet.
Alternative GPUs
The MI60 isn’t the only option. Other cards floating around the secondary market include:
- AMD MI50, MI100
- Various NVIDIA Tesla models
When choosing, consider memory capacity, compute performance, and software support.