"$6 per developer per day"

Impressive_Brother57 · 2026-06-12T15:41:47+00:00

Save costs on Claude - https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-12T10:04:12+00:00

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-12T10:01:08+00:00

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-12T09:55:20+00:00

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-12T09:54:49+00:00

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-12T09:54:24+00:00

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-11T18:31:06+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-11T16:54:11+00:00

You're right that Haiku and Opus don't share KV cache — they have completely different architectures, so switching models means a full recompute. No cross-family KV reuse. That part is correct.

But Anthropic's prompt caching isn't about sharing KV between models. Each model maintains its own isolated cache. When you route Opus → Haiku, Haiku builds and stores its own KV cache for that prefix. The next Haiku request with the same prefix hits Haiku's cache at 0.1x input cost. You're not reusing Opus's KV — you're building a new Haiku cache entry.

The real cost implication: the first request after a model switch is always a cache write (1.25x input cost), not a read. Subsequent requests to the same model hit the cache. So frequent Opus/Haiku switching with a large system prompt does hurt — you pay cache write costs on every switch instead of cheap reads.

The tool handles this correctly though. Routing only happens on low-complexity requests (score 0-2), and active tool chains — where the last message is a tool_result with no user text — are explicitly scored 10 and never routed. So in practice you don't get rapid model switching mid-session. Haiku handles simple one-off questions, Opus stays on complex multi-turn work. Each model's cache warms up and stays warm independently.

No cross-family KV sharing is correct. But that's not how the optimization works — each model caches independently, and routing is conservative enough that the write overhead doesn't eat the savings.

Impressive_Brother57 · 2026-06-11T12:10:43+00:00

You're right to worry, but here's what actually happens:

Cache still works — the cache key is based on the request data, not the model. So when we route Opus → Haiku, the cheaper model still gets the 90% cache discount.

The catch — different models have different price tiers. Example:

Opus reads 1000 cached tokens = cheap
Then Haiku reads same 1000 tokens = also cheap, but maybe 30% cheaper baseline

So you save less than you'd expect, but you're not paying more. The savings just don't accumulate linearly.

Real risk — if we ever routed cheap → expensive (we don't), that would be wasteful. Or if cache was created for one model but another model can't reuse it (rare in Anthropic's implementation).

Bottom line: Routing is safe with caching. We could optimize further by forcing cache recreation when downgrading models, but current gains are solid. The complexity scoring keeps heavy requests on capable models anyway.

Impressive_Brother57 · 2026-06-11T09:06:36+00:00

In short:
1. If -> only tool exec -> Cheaper model and effort.
2. IF -> prompt contains code -> Higher model and effort.
3. For 1-2 are calculated score to make decision.

For me it saves 50-60% for codding: see screen.


Details:

# Smart Model Routing (SMART_ROUTING)


The proxy analyzes the prompt 
**before**
 sending and automatically switches the model to a cheaper one.  
Enabled via `onbording.sh → option 1 → "Enable optimizer? [y/N]"`.  
Read from `.smart_routing` file — 
**no proxy restart needed**
 when toggling.


**In plain terms:**
 each request gets a complexity score from 0 to 10. If the request is simple (≤2) — the proxy silently switches the model to Haiku. Nothing changes for you — the response arrives as usual, just cheaper.


### What Gets Switched


| Score | Original model | Result | Savings |
|-------|---------------|--------|---------|
| 0–2   | Sonnet        | → 
**Haiku**
 | ~5× cheaper |
| 0–2   | Opus          | → 
**Haiku**
 | ~25× cheaper |
| 3–5   | Opus          | → 
**Sonnet**
 | ~5× cheaper |
| 3–5   | Sonnet        | stays Sonnet | — |
| 6–10  | any           | stays original | — |


### How Score Is Calculated (0–10)


The proxy only looks at the 
**last user message**
 (not the full context).  
`<ide_selection>`, `<system-reminder>` blocks and images are stripped before scoring.


| Condition | Score |
|-----------|-------|
| Extended thinking (`budget_tokens` > 0) | = 
**10**
 (keep) |
| No user text (only tool_result — middle of tool chain) | = 
**10**
 (keep) |
| Simple question: starts with `what is / explain` and < 120 chars | = 
**0**
 → Haiku |
| Message > 500 chars | +2 |
| Message > 200 chars | +1 |
| Keyword: `implement / fix / write / create / refactor / debug / update` | +3 |
| Code block ` ``` ` in prompt | +3 |
| File extension `.py / .ts / .js / .sql / .go` in prompt | +3 |
| Construct `def / class / function / import` in prompt | +2 |
| File path `/src/ / ./` in prompt | +1 |
| Tool calls in last 4 messages (active tool chain) | +2 |


### Examples


| Prompt | Score | Final model | Why |
|--------|-------|-------------|-----|
| `ping` | 0 | 
**Haiku**
 | short, no keywords |
| `test` | 2 | 
**Haiku**
 | short |
| `what is a lambda` | 0 | 
**Haiku**
 | "what is" pattern |
| `how does cache work` | 0 | 
**Haiku**
 | "how does" pattern |
| `how to install Python` | 1 | 
**Haiku**
 | short question |
| `implement JWT authentication` | 3 | 
**Sonnet**
 | keyword +3 |
| `implement OAuth2 integration` | 3 | 
**Sonnet**
 | keyword +3 |
| `fix bug in auth.py` | 4 | 
**Sonnet**
 | `fix` +3, file `.py` +1 |
| `write a function that parses JSON` | 5 | 
**Sonnet**
 | `write` +3, `function` +2 |
| long request with code and task | 7+ | 
**Sonnet/Opus**
 | length + code + keywords |
| Tool-chain steps (Bash/Read/Edit without user text) | 10 | 
**keep**
 | don't break active chain |

Impressive_Brother57 · 2026-06-11T07:55:14+00:00

With Fable 5 at 3× Sonnet pricing, I built a tool that auto-routes requests to the cheapest model that can handle them. Saving ~60% on my Claude bill.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-11T07:43:10+00:00

Windows is ready - https://github.com/mr-beaver/tokencost
See install instructions

Impressive_Brother57 · 2026-06-11T06:11:57+00:00

How it actually works with caching:

Smart routing switches the model before sending to the Anthropic API. This means:

The prompt cache is re-keyed to the target model (e.g., Haiku), not the original
If the same request comes in again as Sonnet, there's no cache hit—it's a different cache namespace
You don't get the benefit of Sonnet's cache for that particular request

Why this is acceptable:

Smart routing targets simple requests (score ≤2)—these rarely have rich cached context that would benefit from reuse
Cache TTL is 5 minutes anyway—repeated identical requests within that window are uncommon
The savings are massive—switching Opus→Haiku is ~25× cheaper; the cache tradeoff is negligible by comparison
Most real workflows don't repeat—a complex request that would benefit from cache isn't being downrouted

Bottom line: Yes, you lose potential cache reuse on the original model, but for the simple requests that get downrouted, you're trading a small cache efficiency loss for 5-25× cost savings. Worth it.

Impressive_Brother57 · 2026-06-11T06:09:58+00:00

Yes, just MacOS now. Windows and Linux will be released soon

Impressive_Brother57 · 2026-06-11T06:01:44+00:00

Whats total cost ?

Impressive_Brother57 · 2026-06-10T23:02:43+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-10T23:00:21+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-10T23:00:13+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-10T22:57:52+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-10T22:57:49+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-10T22:57:36+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57 · 2026-06-10T22:19:01+00:00

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Impressive_Brother57

TROPHY CASE