← All guides
Reference 6 min read · Updated April 2026

Which Models Can Your Hardware Actually Run? A RAM Guide

The practical RAM requirements for every popular model size — 3B through 70B — by quantization. Which VPS tier or hardware do you actually need?

The most common question in self-hosted AI: "can my hardware run this model?" The answer depends on model size, quantization, and context length. Here's the definitive reference table.

Model Size × Quantization → RAM Required

Model SizeFP16Q8Q4_K_MQ3_K
3B6.5 GB4.5 GB2.5 GB1.8 GB
7B14 GB9 GB6–7 GB4.5 GB
13B26 GB16 GB10–12 GB8 GB
30B60 GB38 GB22–24 GB17 GB
70B140 GB88 GB48–50 GB38 GB

Rule of thumb: Use Q4_K_M. It's Ollama's default for good reason — 95% of the quality at 50% the size. Only go lower (Q3) if you're genuinely RAM-constrained.

🔓

Read the full guide — free

Enter your email to unlock this guide and all future ones. No spam, one click to unsubscribe.

Free forever. No credit card. Unsubscribe any time.

Hardware Tiers: What Each Level Unlocks

Tier 1 — €5.89/mo Hetzner CX22 · 4 GB RAM

Runs: Phi-3 Mini Q4, Qwen 2.5 3B Q4 (2–3 tok/s)

Can't run: Any 7B model — will swap to disk and become unusably slow

Best for: Learning Ollama, testing configs, hobby side projects

Tier 2 — €6.80/mo ← Sweet spot Hetzner CX32 · 8 GB RAM

Runs: Qwen 2.5 7B Q4, Mistral 7B Q4, Llama 3.3 8B Q4 (8–12 tok/s)

Can't run: 13B Q4 (needs 10–12 GB — too tight with headroom)

Best for: Production code completion, private ChatGPT, RAG pipelines

Tier 3 — €16.40/mo Hetzner CX42 · 16 GB RAM

Runs: Qwen 2.5 13B Q4, Mistral 12B Q4 (5–8 tok/s)

Warning: 13B at 8,192 token context pushes KV cache to edge — test before going to production

Best for: Multi-user inference, business use, higher quality responses

Tier 4 — $599 one-time Mac Mini M4 · 16 GB unified

Runs: Qwen 2.5 7B at 35–42 tok/s via MLX (3–4× faster than CX32)

Can't run: 70B models (needs 48 GB+)

Best for: Always-on home server, privacy-first, highest inference speed under $1k

Tier 5 — ~$0.44/hr RunPod RTX 4090 · 24 GB VRAM

Runs: Llama 3.3 70B FP16 at 40+ tok/s. Any 7B–70B model.

Best for: 70B inference, bursty workloads, pay-per-use

Context Length Matters

KV cache grows with context. A 7B model on an 8 GB server:

Context TokensRAM UsedHeadroom (8 GB)
5126.1 GB1.9 GB ✓
2,0486.7 GB1.3 GB ✓
4,0967.3 GB0.7 GB ⚠️
8,1928.0 GB0 GB ✗

Rule: If you need 4k+ context, go up one tier.

Quantization Decision Tree

Do you have a GPU (8 GB VRAM+)?
├─ YES → use Q5 or FP16 (quality > size)
│         GPU runs these 10× faster than CPU
│
└─ NO  → use Q4_K_M (the only CPU-practical choice)
          RAM headroom <2 GB? → try Q3_K
          RAM headroom >2 GB? → stick with Q4_K_M

Check RAM Before Committing

# On your VPS/Mac after loading a model:
free -h

# Good (CX32 running 7B):
# Mem: 7.0Gi used, 1.0Gi free  ✓

# Too tight:
# Mem: 7.8Gi used, 0.2Gi free  ⚠️ — upgrade or use Q3

Quick Shopping Guide

Just learning

Hetzner CX22 — €5.89/mo

Run 3B models, cancel anytime. Zero commitment.

Copilot replacement

Hetzner CX32 — €6.80/mo

7B Coder model, 8–12 tok/s, always-on.

Home always-on server

Mac Mini M4 — $599

35–42 tok/s via MLX. Breaks even vs VPS in 7 years.

Need 70B now

RunPod RTX 4090 — $0.44/hr

Pay per use. 40+ tok/s on full 70B models.