Most API calls are routine: summarise a doc, format JSON, generate boilerplate. A 7B open model nails these for free. LiteLLM is a proxy that decides which backend to use — local Ollama for the easy stuff, Claude or GPT-4 only when the task genuinely needs it.
The Architecture
Your App
↓ http://localhost:8000/v1/chat/completions
LiteLLM Proxy (decides routing)
├─→ Ollama (80% of requests) — €6.80/mo VPS
└─→ Claude API (20% complex) — $0.003/call
When Ollama fails, LiteLLM retries Claude automatically. When Claude hits rate limits, it falls back to Ollama. Zero code changes in your app.
Read the full guide — free
Enter your email to unlock this guide and all future ones. No spam, one click to unsubscribe.
Free forever. No credit card. Unsubscribe any time.
Step 1: Install LiteLLM
pip install litellm
litellm --version
# LiteLLM 1.x.x
Step 2: Write config.yaml
Create litellm_config.yaml:
model_list:
- model_name: "local-qwen-7b"
litellm_params:
model: "ollama/qwen2.5:7b-instruct-q4_k_m"
api_base: "http://localhost:11434"
- model_name: "claude-3.5-sonnet"
litellm_params:
model: "claude-3-5-sonnet-20241022"
api_key: "sk-ant-..."
- model_name: "gpt-4o"
litellm_params:
model: "gpt-4o"
api_key: "sk-proj-..."
routing:
- model_name: "default"
fallbacks: ["local-qwen-7b", "claude-3.5-sonnet", "gpt-4o"]
- model_name: "summarize"
fallbacks: ["local-qwen-7b"]
- model_name: "reasoning"
fallbacks: ["claude-3.5-sonnet", "local-qwen-7b"]
- model_name: "code"
fallbacks: ["local-qwen-7b", "claude-3.5-sonnet"]
litellm_settings:
fallback_on_api_down: true
num_retries: 3
timeout: 30
log_file: /tmp/litellm.log
Step 3: Start the Proxy
litellm --config litellm_config.yaml --port 8000
# Output:
# INFO: Started server process
# INFO: Uvicorn running on http://0.0.0.0:8000
Step 4: Update Your App
# Before (direct Claude)
from anthropic import Anthropic
client = Anthropic(api_key="sk-ant-...")
response = client.messages.create(model="claude-3-5-sonnet-20241022", ...)
# After (via LiteLLM — one line change)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="any")
response = client.chat.completions.create(
model="default", # LiteLLM picks cheapest available
messages=[{"role": "user", "content": "Summarise this doc..."}],
)
print(response.choices[0].message.content)
Step 5: Route by Task Type
# Cheap task → always use local Ollama
response = client.chat.completions.create(
model="summarize",
messages=[{"role": "user", "content": "Summarise: ..."}],
)
# Complex reasoning → try Claude first
response = client.chat.completions.create(
model="reasoning",
messages=[{"role": "user", "content": "Design a multi-tenant SaaS..."}],
)
Step 6: Monitor Costs
tail -f /tmp/litellm.log
# Example output:
# Model: local-qwen-7b | Tokens: 245 | Cost: $0 | ✓
# Model: claude-3.5-sonnet | Tokens: 512 | Cost: $0.003 | ✓
# Count Claude calls this month:
grep "claude" /tmp/litellm.log | wc -l
# Count all calls:
grep "Model:" /tmp/litellm.log | wc -l
Real Economics
| Scenario | Monthly Cost |
|---|---|
| 1,800 calls/mo all Claude 3.5 | ~$5.40 |
| 1,800 calls/mo all GPT-4o | ~$11.70 |
| LiteLLM 80/20 split + CX32 VPS | ~$7.72 |
| LiteLLM 80/20, 10 products | ~$77/mo vs $117 → save $40/mo |
The bigger your usage, the faster the savings compound. 10 products saving $40/month = $480/year — for the cost of one CX32.
Add Auth for Production
# In litellm_config.yaml
litellm_settings:
api_keys: ["your-secret-key"]
# In your app:
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")
Common Patterns
# Always local unless it fails
fallbacks: ["local-qwen-7b", "claude-3.5-sonnet"]
# Quality first, local fallback
fallbacks: ["claude-3.5-sonnet", "local-qwen-7b"]
# Prefer local 2:1 (round-robin style)
fallbacks: ["local-qwen-7b", "local-qwen-7b", "claude-3.5-sonnet"]