NVIDIA GeForce RTX 5090 32GB
Why this one: The fastest local inference money buys short of datacenter hardware: 1,792 GB/s of memory bandwidth — 1.9× a 3090 — runs 30B-class models (Qwen, Llama, Mistral) at a fluid 40–55 tokens/sec, fully in VRAM at Q4. If your models fit in 32GB, nothing consumer comes close.
What it beat: Workstation cards (RTX 6000-class: 2–3× the price for certified drivers and density, not faster tokens) and the RTX 4090 (when found new, similar street money for 78% of the bandwidth and 24GB).
Tighter budget? The used-3090 pick below, or AMD's RX 7900 XTX 24GB (~$900 new): llama.cpp runs it well via Vulkan/ROCm at 960 GB/s — the best non-NVIDIA value if you'll tolerate setup friction and occasional tooling gaps.
Blackwell silicon is mature; the known risk is the 575W 12V-2x6 power connector — use the native ATX 3.1 cable, never adapters, and seat it fully. Board-partner quality varies; favor brands with strong RMA reputations.
Common concerns (4)
- What fits in 32GB? — Up to ~32B at Q4 with full context comfortably. 70B does NOT fit — partial CPU offload drops speed to single digits; that's the capacity pick's job.
- PSU and power? — 1000W+ ATX 3.1 for one card (575W + spikes). Our PC build's 750W does not cut it — see the add-on.
- Adding a second GPU later? — Consumer boards split to PCIe 5.0 x8/x8: fine for inference (tensor-parallel traffic is modest; it's training that hates thin lanes). Mind case airflow and a 1500W ceiling.
- Intel/AMD instead? — 7900 XTX is the real alternative (above). Intel Arc is budget-tier only for small models; the software (IPEX-LLM) works but trails CUDA tooling.
