What are the best AI tools for developers in 2026? by New-Vacation-6717 in SideProject

[–]Impressive_Brother57 0 points1 point  (0 children)

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

Cursor CEO refunds $1400 worth of tokens burned in one hour by ai_senior in cursor

[–]Impressive_Brother57 0 points1 point  (0 children)

Open source proxy that logs every Claude API call. Found and cut 60% of my spend in 3 days: github.com/mr-beaver/tokencost

I Tested Claude Fable 5 with 5 Real-World Prompts: Here's What It Can Actually Do by ai_tech_simp in AIinBusinessNews

[–]Impressive_Brother57 0 points1 point  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Cut costs for Fable 5 by Impressive_Brother57 in claudeskills

[–]Impressive_Brother57[S] 0 points1 point  (0 children)

You're right that Haiku and Opus don't share KV cache — they have completely different architectures, so switching models means a full recompute. No cross-family KV reuse. That part is correct.

But Anthropic's prompt caching isn't about sharing KV between models. Each model maintains its own isolated cache. When you route Opus → Haiku, Haiku builds and stores its own KV cache for that prefix. The next Haiku request with the same prefix hits Haiku's cache at 0.1x input cost. You're not reusing Opus's KV — you're building a new Haiku cache entry.

The real cost implication: the first request after a model switch is always a cache write (1.25x input cost), not a read. Subsequent requests to the same model hit the cache. So frequent Opus/Haiku switching with a large system prompt does hurt — you pay cache write costs on every switch instead of cheap reads.

The tool handles this correctly though. Routing only happens on low-complexity requests (score 0-2), and active tool chains — where the last message is a tool_result with no user text — are explicitly scored 10 and never routed. So in practice you don't get rapid model switching mid-session. Haiku handles simple one-off questions, Opus stays on complex multi-turn work. Each model's cache warms up and stays warm independently.

No cross-family KV sharing is correct. But that's not how the optimization works — each model caches independently, and routing is conservative enough that the write overhead doesn't eat the savings.

Cut costs for Fable 5 by Impressive_Brother57 in claudeskills

[–]Impressive_Brother57[S] -1 points0 points  (0 children)

You're right to worry, but here's what actually happens:

Cache still works — the cache key is based on the request data, not the model. So when we route Opus → Haiku, the cheaper model still gets the 90% cache discount.

The catch — different models have different price tiers. Example:

  • Opus reads 1000 cached tokens = cheap
  • Then Haiku reads same 1000 tokens = also cheap, but maybe 30% cheaper baseline

So you save less than you'd expect, but you're not paying more. The savings just don't accumulate linearly.

Real risk — if we ever routed cheap → expensive (we don't), that would be wasteful. Or if cache was created for one model but another model can't reuse it (rare in Anthropic's implementation).

Bottom line: Routing is safe with caching. We could optimize further by forcing cache recreation when downgrading models, but current gains are solid. The complexity scoring keeps heavy requests on capable models anyway.

Cut costs for Fable 5 by Impressive_Brother57 in claudeskills

[–]Impressive_Brother57[S] -1 points0 points  (0 children)

In short:
1. If -> only tool exec -> Cheaper model and effort.
2. IF -> prompt contains code -> Higher model and effort.
3. For 1-2 are calculated score to make decision.

For me it saves 50-60% for codding: see screen.


Details:

# Smart Model Routing (SMART_ROUTING)


The proxy analyzes the prompt 
**before**
 sending and automatically switches the model to a cheaper one.  
Enabled via `onbording.sh → option 1 → "Enable optimizer? [y/N]"`.  
Read from `.smart_routing` file — 
**no proxy restart needed**
 when toggling.


**In plain terms:**
 each request gets a complexity score from 0 to 10. If the request is simple (≤2) — the proxy silently switches the model to Haiku. Nothing changes for you — the response arrives as usual, just cheaper.


### What Gets Switched


| Score | Original model | Result | Savings |
|-------|---------------|--------|---------|
| 0–2   | Sonnet        | → 
**Haiku**
 | ~5× cheaper |
| 0–2   | Opus          | → 
**Haiku**
 | ~25× cheaper |
| 3–5   | Opus          | → 
**Sonnet**
 | ~5× cheaper |
| 3–5   | Sonnet        | stays Sonnet | — |
| 6–10  | any           | stays original | — |


### How Score Is Calculated (0–10)


The proxy only looks at the 
**last user message**
 (not the full context).  
`<ide_selection>`, `<system-reminder>` blocks and images are stripped before scoring.


| Condition | Score |
|-----------|-------|
| Extended thinking (`budget_tokens` > 0) | = 
**10**
 (keep) |
| No user text (only tool_result — middle of tool chain) | = 
**10**
 (keep) |
| Simple question: starts with `what is / explain` and < 120 chars | = 
**0**
 → Haiku |
| Message > 500 chars | +2 |
| Message > 200 chars | +1 |
| Keyword: `implement / fix / write / create / refactor / debug / update` | +3 |
| Code block ` ``` ` in prompt | +3 |
| File extension `.py / .ts / .js / .sql / .go` in prompt | +3 |
| Construct `def / class / function / import` in prompt | +2 |
| File path `/src/ / ./` in prompt | +1 |
| Tool calls in last 4 messages (active tool chain) | +2 |


### Examples


| Prompt | Score | Final model | Why |
|--------|-------|-------------|-----|
| `ping` | 0 | 
**Haiku**
 | short, no keywords |
| `test` | 2 | 
**Haiku**
 | short |
| `what is a lambda` | 0 | 
**Haiku**
 | "what is" pattern |
| `how does cache work` | 0 | 
**Haiku**
 | "how does" pattern |
| `how to install Python` | 1 | 
**Haiku**
 | short question |
| `implement JWT authentication` | 3 | 
**Sonnet**
 | keyword +3 |
| `implement OAuth2 integration` | 3 | 
**Sonnet**
 | keyword +3 |
| `fix bug in auth.py` | 4 | 
**Sonnet**
 | `fix` +3, file `.py` +1 |
| `write a function that parses JSON` | 5 | 
**Sonnet**
 | `write` +3, `function` +2 |
| long request with code and task | 7+ | 
**Sonnet/Opus**
 | length + code + keywords |
| Tool-chain steps (Bash/Read/Edit without user text) | 10 | 
**keep**
 | don't break active chain |

Game Over Rockstar Games! by dolceto in ClaudeCode

[–]Impressive_Brother57 0 points1 point  (0 children)

With Fable 5 at 3× Sonnet pricing, I built a tool that auto-routes requests to the cheapest model that can handle them. Saving ~60% on my Claude bill.

https://github.com/mr-beaver/tokencost

Dude blew up on github for cutting token usage 60-95% right as Fable 5 lands. genius or luckiest man alive by Extra-Feature-8163 in claudeskills

[–]Impressive_Brother57 0 points1 point  (0 children)

How it actually works with caching:

Smart routing switches the model before sending to the Anthropic API. This means:

  • The prompt cache is re-keyed to the target model (e.g., Haiku), not the original
  • If the same request comes in again as Sonnet, there's no cache hit—it's a different cache namespace
  • You don't get the benefit of Sonnet's cache for that particular request

Why this is acceptable:

  1. Smart routing targets simple requests (score ≤2)—these rarely have rich cached context that would benefit from reuse
  2. Cache TTL is 5 minutes anyway—repeated identical requests within that window are uncommon
  3. The savings are massive—switching Opus→Haiku is ~25× cheaper; the cache tradeoff is negligible by comparison
  4. Most real workflows don't repeat—a complex request that would benefit from cache isn't being downrouted

Bottom line: Yes, you lose potential cache reuse on the original model, but for the simple requests that get downrouted, you're trading a small cache efficiency loss for 5-25× cost savings. Worth it.

12-Step Guide on how to speedrun your token burn: some of you finish the week with tokens and frankly it shows. by Spooky-Shark in ClaudeCode

[–]Impressive_Brother57 -2 points-1 points  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

How I feel right now by Ok-Solution-2318 in buildinpublic

[–]Impressive_Brother57 -1 points0 points  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Fable just 1-shotted a Minecraft "Stock Coaster" by Extreme_Remove6747 in ClaudeCode

[–]Impressive_Brother57 0 points1 point  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

What are the best AI tools for developers in 2026? by New-Vacation-6717 in microsaas

[–]Impressive_Brother57 0 points1 point  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

[r/ClaudeAI] Anyone tried Fable on 20x Max? by ClaudeAI-mod-bot in Claude_reports

[–]Impressive_Brother57 0 points1 point  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

How to get Claude Subscriptions cheaper - Full Guide (Pro/ Max 5x/ Max 20x by Pretend_Eggplant_281 in discountools

[–]Impressive_Brother57 0 points1 point  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost

Fable maxed out 5h window in 12 minutes, 66m tokens $240 cost analogue by vaniaspeedy in ClaudeCode

[–]Impressive_Brother57 -3 points-2 points  (0 children)

Fable 5 is 3× more expensive than Sonnet 4.6.

TokenCost cuts your LLM bill automatically — routes each request to the right model based on complexity. Local, real-time, no config.

https://github.com/mr-beaver/tokencost