LiteLLM + Ollama: Route 80% of Traffic Away From the API — clawdVPS

Most API calls are routine: summarise a doc, format JSON, generate boilerplate. A 7B open model nails these for free. LiteLLM is a proxy that decides which backend to use — local Ollama for the easy stuff, Claude or GPT-4 only when the task genuinely needs it.

The Architecture

Your App
    ↓ http://localhost:8000/v1/chat/completions
LiteLLM Proxy  (decides routing)
    ├─→ Ollama  (80% of requests)  — €6.80/mo VPS
    └─→ Claude API  (20% complex)  — $0.003/call

When Ollama fails, LiteLLM retries Claude automatically. When Claude hits rate limits, it falls back to Ollama. Zero code changes in your app.

🔓

Read the full guide — free

Enter your email to unlock this guide and all future ones. No spam, one click to unsubscribe.

Free forever. No credit card. Unsubscribe any time.

Step 1: Install LiteLLM

pip install litellm

litellm --version
# LiteLLM 1.x.x

Step 2: Write config.yaml

Create litellm_config.yaml:

model_list:
  - model_name: "local-qwen-7b"
    litellm_params:
      model: "ollama/qwen2.5:7b-instruct-q4_k_m"
      api_base: "http://localhost:11434"

  - model_name: "claude-3.5-sonnet"
    litellm_params:
      model: "claude-3-5-sonnet-20241022"
      api_key: "sk-ant-..."

  - model_name: "gpt-4o"
    litellm_params:
      model: "gpt-4o"
      api_key: "sk-proj-..."

routing:
  - model_name: "default"
    fallbacks: ["local-qwen-7b", "claude-3.5-sonnet", "gpt-4o"]

  - model_name: "summarize"
    fallbacks: ["local-qwen-7b"]

  - model_name: "reasoning"
    fallbacks: ["claude-3.5-sonnet", "local-qwen-7b"]

  - model_name: "code"
    fallbacks: ["local-qwen-7b", "claude-3.5-sonnet"]

litellm_settings:
  fallback_on_api_down: true
  num_retries: 3
  timeout: 30

log_file: /tmp/litellm.log

Step 3: Start the Proxy

litellm --config litellm_config.yaml --port 8000

# Output:
# INFO: Started server process
# INFO: Uvicorn running on http://0.0.0.0:8000

Step 4: Update Your App

# Before (direct Claude)
from anthropic import Anthropic
client = Anthropic(api_key="sk-ant-...")
response = client.messages.create(model="claude-3-5-sonnet-20241022", ...)

# After (via LiteLLM — one line change)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="any")
response = client.chat.completions.create(
    model="default",   # LiteLLM picks cheapest available
    messages=[{"role": "user", "content": "Summarise this doc..."}],
)
print(response.choices[0].message.content)

Step 5: Route by Task Type

# Cheap task → always use local Ollama
response = client.chat.completions.create(
    model="summarize",
    messages=[{"role": "user", "content": "Summarise: ..."}],
)

# Complex reasoning → try Claude first
response = client.chat.completions.create(
    model="reasoning",
    messages=[{"role": "user", "content": "Design a multi-tenant SaaS..."}],
)

Step 6: Monitor Costs

tail -f /tmp/litellm.log

# Example output:
# Model: local-qwen-7b    | Tokens: 245 | Cost: $0      | ✓
# Model: claude-3.5-sonnet | Tokens: 512 | Cost: $0.003  | ✓

# Count Claude calls this month:
grep "claude" /tmp/litellm.log | wc -l

# Count all calls:
grep "Model:" /tmp/litellm.log | wc -l

Real Economics

Scenario	Monthly Cost
1,800 calls/mo all Claude 3.5	~$5.40
1,800 calls/mo all GPT-4o	~$11.70
LiteLLM 80/20 split + CX32 VPS	~$7.72
LiteLLM 80/20, 10 products	~$77/mo vs $117 → save $40/mo

The bigger your usage, the faster the savings compound. 10 products saving $40/month = $480/year — for the cost of one CX32.

Add Auth for Production

# In litellm_config.yaml
litellm_settings:
  api_keys: ["your-secret-key"]

# In your app:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

Common Patterns

# Always local unless it fails
fallbacks: ["local-qwen-7b", "claude-3.5-sonnet"]

# Quality first, local fallback
fallbacks: ["claude-3.5-sonnet", "local-qwen-7b"]

# Prefer local 2:1 (round-robin style)
fallbacks: ["local-qwen-7b", "local-qwen-7b", "claude-3.5-sonnet"]