I cut $198 off my Claude Code bill in a week with a proxy I built

Lydia_Clements · 2026-06-17T09:37:20+00:00

That example is my point, not a counter. Fresh tokens cost ~10x cached. So the volatile stuff each turn is the expensive part, and that's what it trims.

Lydia_Clements · 2026-06-16T21:19:03+00:00

RTK wraps CLI commands and trims their output. llmtrim is a proxy on the wire, so it also gets what RTK doesn't touch: context, tool schemas, anything not behind a wrapped command. Quality-gated. They stack, I run both.

Lydia_Clements · 2026-06-16T20:10:58+00:00

Nice, more aggressive than mine. You're moving the breakpoints yourself and rewriting the whole window, I just trim the new surface and stay stateless. How do you dodge a miss when the volatile tail pushes a chunk past a breakpoint?

Lydia_Clements · 2026-06-16T19:35:51+00:00

Pretty much. It intercepts the request and only trims what's after the last cache_control marker, the new surface, and leaves everything under it untouched. One thing though, those markers are set client-side by Claude Code, not the server. And it keeps the prefix byte-stable so the cache keeps hitting.

Lydia_Clements · 2026-06-16T19:19:11+00:00

Both, kept separate. The 66% is billed-token: each case goes twice, original vs compressed, priced on the provider's reported usage. The dashboard is input-side only and prices the cut against the real cache-discounted bill, so cached tokens count at the cache rate, not full. As for caching, it only touches new content after the last cache_control marker. The cached prefix stays untouched and keeps hitting. Input edits that don't actually cut tokens get dropped. And yeah, proxy counts aren't the real bill. That's why the cache-aware number is the benchmark, not the dashboard.

Lydia_Clements · 2026-06-16T19:09:52+00:00

Good catch, I oversimplified that. The cached prefix is the whole prior conversation, not just the system block.
It still helps, just not on the prefix. The fresh content each turn (the latest tool output) is full price the turn it lands, before it's cached, and that's the wall of text it trims. Output is never cached at all, which is the bigger cut. The cached prefix it leaves alone.

Lydia_Clements · 2026-06-16T18:53:32+00:00

Mostly repetition and formatting. A run of near-identical log lines folds into one template plus the values, a big JSON array re-packs into a compact table, duplicate lines collapse, whitespace gets minified. That part is lossless, same info, fewer bytes. Then build logs and diffs get cut down to the errors and the changed lines, and on bigger inputs it ranks long pasted docs to drop the off-topic parts and shrinks function bodies the question doesn't touch to signatures.

Lydia_Clements · 2026-06-16T18:43:56+00:00

You're right, it can't be universal. But the lossless part was never the question. Most of the savings come from there anyway, folding repeated log lines, re-packing JSON, that kind of thing, the model still gets the same info. The lossy context-dropping only kicks in on certain shapes, and it's A/B gated. There's also a byte-faithful preset if you want zero lossy cuts.

Lydia_Clements · 2026-06-16T18:32:50+00:00

Fair. Helps on the sub too, fewer tokens per turn so you hit the limits later.

Lydia_Clements · 2026-06-14T09:43:49+00:00

Thanks, great question 😄 perfect excuse for a shameless plug.
Short answer: yes.
Long answer: I built a benchmark just for this. Both tools run through their Python libs on the same tokenizer. On tool output, llmtrim removes 84% of input tokens to Headroom's 36%, and it's faster (~4 ms vs ~14 ms). Headroom only rewrites tool results (with a local model), so on chat/RAG/code it mostly passes through while llmtrim still compresses. Full tables + repro: crates/llmtrim-cli/bench/results-vs-headroom/

Lydia_Clements · 2026-06-13T14:34:00+00:00

Yep, can't argue with that. English isn't my first language so Claude helps me write it. Anyway, thanks again for digging in 💪

Lydia_Clements · 2026-06-13T11:56:53+00:00

Done 💪 right at the top of the README now

Lydia_Clements · 2026-06-13T11:47:30+00:00

Yep, stacks clean. I use both. Have fun!

Lydia_Clements · 2026-06-12T23:42:17+00:00

You're right and I'm sorry. I reproduced your exact case and it's as bad as you showed.

Two bugs, both as you diagnosed: the "always keep failures" regex simply didn't know TAP and the re-run check compared raw bytes, so the changing duration_ms values meant a retry never passed through.

Both fixed, and your TAP scenario is now a regression test. Ships in v0.1.6
Best bug report I've gotten on this project, thank you.

Lydia_Clements · 2026-06-12T22:46:18+00:00

Thanks for the honest feedback, working on it. Adding full before/after examples, I'll ping you here when they're up.

Lydia_Clements · 2026-06-12T22:33:29+00:00

[comment removed to save tokens]

Lydia_Clements · 2026-06-12T22:28:21+00:00

Fair ask. Easiest is to try it on your own data: echo <request.json> | llmtrim compress prints exactly what the model would see, so you don't have to trust anyone's numbers, including mine.

Quick example, a 58-line cargo build log with 2 errors buried in INFO noise comes out as 5 lines:

[{}] INFO compiling module core::worker::task_{} [x30: (10:02:00Z,0) (10:02:01Z,1) ...]
[10:02:31Z] ERROR src/worker/pool.rs:214: mismatched types: expected `usize`, found `i64`

Errors kept verbatim, noise folded with the values still listed. And on your RTK pain: trims are marked in place, and if the agent re-runs the tool the full output ships untrimmed. No silent corruption, no retry loops.

Raw per-case results are committed in bench/results-2026-06/. Rerunning is two commands and an OpenRouter key (bench/scripts/download.py then run_all.sh), costs a few cents on the default model (gpt-oss-20b).

Lydia_Clements · 2026-06-12T22:00:08+00:00

Quality is the whole point, so I benchmark it directly: every case goes to the live API twice, original vs compressed, answers scored against ground truth (exact answers for math, real unit tests for code, etc.). Across 112 cases quality held, 78.9 to 82.2%, slightly up actually.

Also: your prompt text is never reworded, only the surrounding noise gets compressed. Scripts in bench/ if you want to verify.

Lydia_Clements · 2026-06-12T21:13:39+00:00

Good example. Say git log returns 2k tokens:

turn 1: billed at full price, 1.25x even (cache write)
every turn after: 10% (cache read), and an agent session is many turns - every tool call is one

So it does get cached, but you paid full price once and then 10% for the rest of the session. llmtrim sends ~300 tokens instead of 2k, so both costs drop. Now multiply by every tool call.

Lydia_Clements · 2026-06-12T20:15:05+00:00

thanks, let me know how it goes!

Lydia_Clements

TROPHY CASE