Which Models Can Your Hardware Actually Run? A RAM Guide — clawdVPS

The most common question in self-hosted AI: "can my hardware run this model?" The answer depends on model size, quantization, and context length. Here's the definitive reference table.

Model Size × Quantization → RAM Required

Model Size	FP16	Q8	Q4_K_M	Q3_K
3B	6.5 GB	4.5 GB	2.5 GB	1.8 GB
7B	14 GB	9 GB	6–7 GB	4.5 GB
13B	26 GB	16 GB	10–12 GB	8 GB
30B	60 GB	38 GB	22–24 GB	17 GB
70B	140 GB	88 GB	48–50 GB	38 GB

Rule of thumb: Use Q4_K_M. It's Ollama's default for good reason — 95% of the quality at 50% the size. Only go lower (Q3) if you're genuinely RAM-constrained.

🔓

Read the full guide — free

Enter your email to unlock this guide and all future ones. No spam, one click to unsubscribe.

Free forever. No credit card. Unsubscribe any time.

Hardware Tiers: What Each Level Unlocks

Tier 1 — €5.89/mo Hetzner CX22 · 4 GB RAM

Runs: Phi-3 Mini Q4, Qwen 2.5 3B Q4 (2–3 tok/s)

Can't run: Any 7B model — will swap to disk and become unusably slow

Best for: Learning Ollama, testing configs, hobby side projects

Tier 2 — €6.80/mo ← Sweet spot Hetzner CX32 · 8 GB RAM

Runs: Qwen 2.5 7B Q4, Mistral 7B Q4, Llama 3.3 8B Q4 (8–12 tok/s)

Can't run: 13B Q4 (needs 10–12 GB — too tight with headroom)

Best for: Production code completion, private ChatGPT, RAG pipelines

Tier 3 — €16.40/mo Hetzner CX42 · 16 GB RAM

Runs: Qwen 2.5 13B Q4, Mistral 12B Q4 (5–8 tok/s)

Warning: 13B at 8,192 token context pushes KV cache to edge — test before going to production

Best for: Multi-user inference, business use, higher quality responses

Tier 4 — $599 one-time Mac Mini M4 · 16 GB unified

Runs: Qwen 2.5 7B at 35–42 tok/s via MLX (3–4× faster than CX32)

Can't run: 70B models (needs 48 GB+)

Best for: Always-on home server, privacy-first, highest inference speed under $1k

Tier 5 — ~$0.44/hr RunPod RTX 4090 · 24 GB VRAM

Runs: Llama 3.3 70B FP16 at 40+ tok/s. Any 7B–70B model.

Best for: 70B inference, bursty workloads, pay-per-use

Context Length Matters

KV cache grows with context. A 7B model on an 8 GB server:

Context Tokens	RAM Used	Headroom (8 GB)
512	6.1 GB	1.9 GB ✓
2,048	6.7 GB	1.3 GB ✓
4,096	7.3 GB	0.7 GB ⚠️
8,192	8.0 GB	0 GB ✗

Rule: If you need 4k+ context, go up one tier.

Quantization Decision Tree

Do you have a GPU (8 GB VRAM+)?
├─ YES → use Q5 or FP16 (quality > size)
│         GPU runs these 10× faster than CPU
│
└─ NO  → use Q4_K_M (the only CPU-practical choice)
          RAM headroom <2 GB? → try Q3_K
          RAM headroom >2 GB? → stick with Q4_K_M

Check RAM Before Committing

# On your VPS/Mac after loading a model:
free -h

# Good (CX32 running 7B):
# Mem: 7.0Gi used, 1.0Gi free  ✓

# Too tight:
# Mem: 7.8Gi used, 0.2Gi free  ⚠️ — upgrade or use Q3

Quick Shopping Guide

Just learning

Hetzner CX22 — €5.89/mo

Run 3B models, cancel anytime. Zero commitment.

Copilot replacement

Hetzner CX32 — €6.80/mo

7B Coder model, 8–12 tok/s, always-on.

Home always-on server

Mac Mini M4 — $599

35–42 tok/s via MLX. Breaks even vs VPS in 7 years.

Need 70B now

RunPod RTX 4090 — $0.44/hr

Pay per use. 40+ tok/s on full 70B models.