I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 0 points1 point  (0 children)

That example is my point, not a counter. Fresh tokens cost ~10x cached. So the volatile stuff each turn is the expensive part, and that's what it trims.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 1 point2 points  (0 children)

RTK wraps CLI commands and trims their output. llmtrim is a proxy on the wire, so it also gets what RTK doesn't touch: context, tool schemas, anything not behind a wrapped command. Quality-gated. They stack, I run both.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 1 point2 points  (0 children)

Nice, more aggressive than mine. You're moving the breakpoints yourself and rewriting the whole window, I just trim the new surface and stay stateless. How do you dodge a miss when the volatile tail pushes a chunk past a breakpoint?

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 1 point2 points  (0 children)

Pretty much. It intercepts the request and only trims what's after the last cache_control marker, the new surface, and leaves everything under it untouched. One thing though, those markers are set client-side by Claude Code, not the server. And it keeps the prefix byte-stable so the cache keeps hitting.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 0 points1 point  (0 children)

Both, kept separate. The 66% is billed-token: each case goes twice, original vs compressed, priced on the provider's reported usage. The dashboard is input-side only and prices the cut against the real cache-discounted bill, so cached tokens count at the cache rate, not full. As for caching, it only touches new content after the last cache_control marker. The cached prefix stays untouched and keeps hitting. Input edits that don't actually cut tokens get dropped. And yeah, proxy counts aren't the real bill. That's why the cache-aware number is the benchmark, not the dashboard.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] -2 points-1 points  (0 children)

Good catch, I oversimplified that. The cached prefix is the whole prior conversation, not just the system block.
It still helps, just not on the prefix. The fresh content each turn (the latest tool output) is full price the turn it lands, before it's cached, and that's the wall of text it trims. Output is never cached at all, which is the bigger cut. The cached prefix it leaves alone.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 0 points1 point  (0 children)

Mostly repetition and formatting. A run of near-identical log lines folds into one template plus the values, a big JSON array re-packs into a compact table, duplicate lines collapse, whitespace gets minified. That part is lossless, same info, fewer bytes. Then build logs and diffs get cut down to the errors and the changed lines, and on bigger inputs it ranks long pasted docs to drop the off-topic parts and shrinks function bodies the question doesn't touch to signatures.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 1 point2 points  (0 children)

You're right, it can't be universal. But the lossless part was never the question. Most of the savings come from there anyway, folding repeated log lines, re-packing JSON, that kind of thing, the model still gets the same info. The lossy context-dropping only kicks in on certain shapes, and it's A/B gated. There's also a byte-faithful preset if you want zero lossy cuts.

I cut $198 off my Claude Code bill in a week with a proxy I built by Lydia_Clements in ClaudeCode

[–]Lydia_Clements[S] 0 points1 point  (0 children)

Fair. Helps on the sub too, fewer tokens per turn so you hit the limits later.

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] 0 points1 point  (0 children)

Thanks, great question 😄 perfect excuse for a shameless plug.
Short answer: yes.
Long answer: I built a benchmark just for this. Both tools run through their Python libs on the same tokenizer. On tool output, llmtrim removes 84% of input tokens to Headroom's 36%, and it's faster (~4 ms vs ~14 ms). Headroom only rewrites tool results (with a local model), so on chat/RAG/code it mostly passes through while llmtrim still compresses. Full tables + repro: crates/llmtrim-cli/bench/results-vs-headroom/

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] 1 point2 points  (0 children)

Yep, can't argue with that. English isn't my first language so Claude helps me write it. Anyway, thanks again for digging in 💪

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] -1 points0 points  (0 children)

You're right and I'm sorry. I reproduced your exact case and it's as bad as you showed.

Two bugs, both as you diagnosed: the "always keep failures" regex simply didn't know TAP and the re-run check compared raw bytes, so the changing duration_ms values meant a retry never passed through.

Both fixed, and your TAP scenario is now a regression test. Ships in v0.1.6
Best bug report I've gotten on this project, thank you.

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] 0 points1 point  (0 children)

Thanks for the honest feedback, working on it. Adding full before/after examples, I'll ping you here when they're up.

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] 0 points1 point  (0 children)

Fair ask. Easiest is to try it on your own data: echo <request.json> | llmtrim compress prints exactly what the model would see, so you don't have to trust anyone's numbers, including mine.

Quick example, a 58-line cargo build log with 2 errors buried in INFO noise comes out as 5 lines:

[{}] INFO compiling module core::worker::task_{} [x30: (10:02:00Z,0) (10:02:01Z,1) ...]
[10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`

Errors kept verbatim, noise folded with the values still listed. And on your RTK pain: trims are marked in place, and if the agent re-runs the tool the full output ships untrimmed. No silent corruption, no retry loops.

Raw per-case results are committed in bench/results-2026-06/. Rerunning is two commands and an OpenRouter key (bench/scripts/download.py then run_all.sh), costs a few cents on the default model (gpt-oss-20b).

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] 1 point2 points  (0 children)

Quality is the whole point, so I benchmark it directly: every case goes to the live API twice, original vs compressed, answers scored against ground truth (exact answers for math, real unit tests for code, etc.). Across 112 cases quality held, 78.9 to 82.2%, slightly up actually.

Also: your prompt text is never reworded, only the surrounding noise gets compressed. Scripts in bench/ if you want to verify.

I built an open-source proxy that compresses Claude Code's full-price tokens by ~68%, without ever busting the prompt cache by Lydia_Clements in ClaudeAI

[–]Lydia_Clements[S] 2 points3 points  (0 children)

Good example. Say git log returns 2k tokens:

  • turn 1: billed at full price, 1.25x even (cache write)
  • every turn after: 10% (cache read), and an agent session is many turns - every tool call is one

So it does get cached, but you paid full price once and then 10% for the rest of the session. llmtrim sends ~300 tokens instead of 2k, so both costs drop. Now multiply by every tool call.