The most common question in self-hosted AI: "can my hardware run this model?" The answer depends on model size, quantization, and context length. Here's the definitive reference table.
Model Size × Quantization → RAM Required
| Model Size | FP16 | Q8 | Q4_K_M | Q3_K |
|---|---|---|---|---|
| 3B | 6.5 GB | 4.5 GB | 2.5 GB | 1.8 GB |
| 7B | 14 GB | 9 GB | 6–7 GB | 4.5 GB |
| 13B | 26 GB | 16 GB | 10–12 GB | 8 GB |
| 30B | 60 GB | 38 GB | 22–24 GB | 17 GB |
| 70B | 140 GB | 88 GB | 48–50 GB | 38 GB |
Rule of thumb: Use Q4_K_M. It's Ollama's default for good reason — 95% of the quality at 50% the size. Only go lower (Q3) if you're genuinely RAM-constrained.
Read the full guide — free
Enter your email to unlock this guide and all future ones. No spam, one click to unsubscribe.
Free forever. No credit card. Unsubscribe any time.
Hardware Tiers: What Each Level Unlocks
Runs: Phi-3 Mini Q4, Qwen 2.5 3B Q4 (2–3 tok/s)
Can't run: Any 7B model — will swap to disk and become unusably slow
Best for: Learning Ollama, testing configs, hobby side projects
Runs: Qwen 2.5 7B Q4, Mistral 7B Q4, Llama 3.3 8B Q4 (8–12 tok/s)
Can't run: 13B Q4 (needs 10–12 GB — too tight with headroom)
Best for: Production code completion, private ChatGPT, RAG pipelines
Runs: Qwen 2.5 13B Q4, Mistral 12B Q4 (5–8 tok/s)
Warning: 13B at 8,192 token context pushes KV cache to edge — test before going to production
Best for: Multi-user inference, business use, higher quality responses
Runs: Qwen 2.5 7B at 35–42 tok/s via MLX (3–4× faster than CX32)
Can't run: 70B models (needs 48 GB+)
Best for: Always-on home server, privacy-first, highest inference speed under $1k
Runs: Llama 3.3 70B FP16 at 40+ tok/s. Any 7B–70B model.
Best for: 70B inference, bursty workloads, pay-per-use
Context Length Matters
KV cache grows with context. A 7B model on an 8 GB server:
| Context Tokens | RAM Used | Headroom (8 GB) |
|---|---|---|
| 512 | 6.1 GB | 1.9 GB ✓ |
| 2,048 | 6.7 GB | 1.3 GB ✓ |
| 4,096 | 7.3 GB | 0.7 GB ⚠️ |
| 8,192 | 8.0 GB | 0 GB ✗ |
Rule: If you need 4k+ context, go up one tier.
Quantization Decision Tree
Do you have a GPU (8 GB VRAM+)?
├─ YES → use Q5 or FP16 (quality > size)
│ GPU runs these 10× faster than CPU
│
└─ NO → use Q4_K_M (the only CPU-practical choice)
RAM headroom <2 GB? → try Q3_K
RAM headroom >2 GB? → stick with Q4_K_M
Check RAM Before Committing
# On your VPS/Mac after loading a model:
free -h
# Good (CX32 running 7B):
# Mem: 7.0Gi used, 1.0Gi free ✓
# Too tight:
# Mem: 7.8Gi used, 0.2Gi free ⚠️ — upgrade or use Q3
Quick Shopping Guide
Hetzner CX22 — €5.89/mo
Run 3B models, cancel anytime. Zero commitment.
Hetzner CX32 — €6.80/mo
7B Coder model, 8–12 tok/s, always-on.
Mac Mini M4 — $599
35–42 tok/s via MLX. Breaks even vs VPS in 7 years.
RunPod RTX 4090 — $0.44/hr
Pay per use. 40+ tok/s on full 70B models.