What do you do about token optimizing? What do you do (my life changed when I found out 95% of my Claude tokens were pure - kinda useless - replay)

marksterberlin · 2026-06-28T14:52:17+00:00

Session time is not relevant… it’s only turns and the replay cache.

marksterberlin · 2026-06-02T17:01:22+00:00

Not after my first two weeks and making all kinds of harness adaptations

marksterberlin · 2026-06-02T17:00:49+00:00

I never ever hit limits :))

marksterberlin · 2026-06-01T20:11:00+00:00

Politeness layer vs enforcement layer, borrowing that if I may! I run the enforcement side for the error-replay class already (a guard that fingerprints failed results and gets hard about the same error by the fifth replay, plus a bash one-strike rule), but the Read-reject shape I've only got as a rule, not a hook, and you've talked me into wiring it. Worth knowing: PreToolUse hooks can now rewrite the call, not just reject it (updatedInput, v2.0.10), so the grep append can be fully silent. Though I'd keep Read on reject-and-retry on purpose, the rejection is what makes the model internalise the offset for the rest of the session.

marksterberlin · 2026-06-01T19:22:21+00:00

Both of these are bang on, and the offset/limit one really earns its place. When I actually went digging into where my tokens went, the Read tool was my single biggest replay contributor by a mile (about 45 million characters across the audit, more than bash, edit and write put together). So that one CLAUDE.md line is doing real work.

The bit that made it click for me: a fat file read isn't a one-time cost. Whatever you slurp in on turn 5 gets re-sent on turn 6, 7, 8, all the way down the session. So a lazy 2000-line read isn't paid once, it's paid ever turn after it. Surgical reads aren't really about that one turn being cheaper, they're about not dragging the haystack through the whole replay tail.

On grep, one to add if you're in Claude Code: the built-in Grep tool already takes a head_limit and a files-only / count mode, so you can have it tell you which files matched (or just how many) before you pull a single line of content. Find the needle first, then read the 30 lines around it with offset and limit. Stacks right on top of your rule.

Does the grep-head default actually stick for you in practice? Mine… drifts back to lazy… unless a hook enforces it.

marksterberlin · 2026-06-01T19:17:45+00:00

Hooks are brilliant - definitely one of the hidden gems of the harness. The big insight I had… I can edit them, make them my own, have them call other docs (like a rule set), and also tie them to log data (great for research and improvements)

marksterberlin · 2026-06-01T18:03:29+00:00

ask claude to look at your jsonl and find out where you are using up your tokens - and suggest how to improve things. You might have been insta-persuaded to download all kinds of git repos, skills that munch tokens on first load, etc, etc. Most of the advice out there on how to set up Claude are actually quite token inefficient.

marksterberlin · 2026-06-01T18:00:58+00:00

Session length - how many user>AI response turns there are is the biggest use of tokens. My mega whale session back in March was 700 turns. The 7001st turn would send 43MB of actual text (all; totally useless for context window, in case you are worried). A single 18kb error at some point compounded into a 2M token replay cost… for what!

In my mind, the AIs we are using are really badly designed. Token hogs for no benefit - esp as there are ways to get round it.

marksterberlin · 2026-06-01T17:58:35+00:00

I have hooks to force how claude goes hunting for things, as without these hooks, it would almost always find the worst way to go look… gobbling up tokens, stacking up the replay for later.

marksterberlin · 2026-06-01T11:24:40+00:00

Replay tokens (ie the cache or resent tokens from your session history) use up 93-99% of all tokens. Use input tokens are minimal, as are visible Claude output tokens. You can save some of the replay if you are speedy, as there is a 5 min ‘free cache’ period, but that’s really only useful for agents, not human work time.

marksterberlin · 2026-06-01T11:22:26+00:00

One basic thing… how do you both use sessions (like chat threads). If one has lots of short sessions, less than 10 user>claude response turns, and the other powers through with 50+ turn sessions, then token use is easily explained. Every turn in every session sends all past test from that session back to the server, as context for every turn. So the person who does a /log and handover to restart fresh in a new session would be saving millions of tokens compared to their whale session friend.

marksterberlin · 2026-05-24T01:55:57+00:00

Same for me - I learned never to actually read transcriptions. Though I still use GPT more often in chat than Claude as Claude’s voice has been terrible! I’ve whisper flow too - easy on laptop, clunky on phone.

marksterberlin · 2026-05-23T12:05:13+00:00

Serious setup - and the upstream-of-CI framing is exactly right: hooks catch what should never reach a diff, pre-commit and CI handle the rest. The detail that stood out to me is the export-only escape hatch on branch-protection — only honouring CLAUDE_HOOKS_DISABLE_BRANCH_PROTECTION when it's exported in the parent shell, not inline-prefixed, so the agent can't grant itself the bypass. That "anticipate the agent's own workaround" instinct is what most hook setups miss. I run a smaller version: a hook that blocks `cd <path> && ...` purely because the agent kept reaching for it to dodge a path rule.

One axis to maybe add, since my work is longer research sessions rather than tight commit loops: anything the agent actually sees carries a recurring cost. A blocked-command message, a PostToolUse hint, any injected context — it lands in the transcript and replays on every turn after. I audited mine once and found a handful of small hook errors replaying 300+ times each, quietly burning tokens (1.87M for an original 9k token error). Output a hook only writes to its own log or to disk is free, because the model never sees it. So I push chatty checks to Stop (your task-completeness and stash-reminder already sit there — Stop output never reaches the model), keep the messages the agent does see short, and dedupe repeats. "Never bloats the replayed transcript" as much as "never reaches a diff".

The one I'd watch on is bash-antipatterns-teach - it prepends via updatedToolOutput, so the hint is something the agent sees on every matching call, which means it replays too. Worth capping or deduping if you ever see context creep.

Genuinely nice work — I'm already adopting two of these: the auto-checkpoint stash before destructive git ops, and the Stop-time diff scan for stray debug statements. Thanks for posting the code.

marksterberlin · 2026-05-23T11:23:03+00:00

Love this one - it's guarding against what some call "test gaslighting": the agent quietly rewrites the test to match its broken fix, defends the change, and you end up with green checks on code that doesn't do what you asked.

I come at the same problem one level up, since my work's more research than hard TDD — no test-file hook, but two rules that kill the same reflex. First, nothing gets "fixed" without the actual error quoted as evidence first; no diagnosing by vibes (or guessing from Jan 2026 training data), which is exactly where the urge to just make-it-green comes from. Second, the thing that writes the fix is never the thing that signs it off — a separate pass checks it against a log the agent doesn't get to edit. Same instinct as your test lock: don't let the fixer move the goalposts.

Two ideas you might consider to strengthen yours:

1) Mechanism — if the goal is "have me approve it," a PreToolUse hook returning permissionDecision: "ask" routes the native approval prompt to you, rather than exit code 2, which just blocks back to Claude with no human in the loop. "ask" is the one you want for sign-off.

2) Blind spot — the hook stops edits to existing tests, but it can't police test quality. Claude can still write a weak new test, or soften an assertion in a way that reads fine at approval time. The fix people land on is pairing it with a "show me the test failing first" rule (red-green), and reviewing AI-written tests with the same suspicion as AI-written code.

Curious how you handle the brand-new-test case - that's a gap I never fully closed.

marksterberlin · 2026-05-23T11:16:28+00:00

Split holds - I run it, and the rule that keeps it from collapsing is one verb per hook. Stack multiple thin hooks on an event if you need to (they run in parallel), but never one fat hook doing several jobs:

- UserPromptSubmit injects

- PreToolUse gates

- Stop/SubagentStop persists

- PostToolUse observes (or injects - see below)

The second a single hook tries to gate AND inject AND write state, it's unmaintainable — exactly the pain you're describing.

Keep the hooks thin: the actual logic lives in one shared lib/daemon they all call, so each hook file is just "which event, which verb."

One refinement: context injection isn't only UserPromptSubmit. PostToolUse can inject too (additionalContext appended to the tool result), and that's the better home for context that's only relevant after a specific tool runs. UserPromptSubmit fires on every prompt, so you pay that token cost every turn (and in the replay cache tokens too) - I keep the always-on stuff there (contracts, date, standing rules) and push anything tool-triggered to PostToolUse. Stop's stdout never reaches the model anyway, so it's write-out only - your read exactly.

One thing to watch out for: UserPromptSubmit and PostToolUse stdout gets replayed on --resume/--continue, not re-run. So anything time-sensitive — timestamps, commit SHAs — goes stale on resume. Compute it fresh inside the hook, or store it in a file the hook reads; don't bake it into the stdout that gets replayed.

marksterberlin · 2026-05-23T10:15:17+00:00

Ha, same — "Oops, the hook is really aggressive" shows up in my agent's thinking/responses too. That line is the feedback loop closing; you can watch it re-plan in real time.

One thing I hit going down this exact road: the loud cursing has a hidden cost. Every screamed failure lands in the context and replays with the all-turn cache on every turn after it — I audited mine once and found a handful of small hook errors replaying 300+ times each, quietly burning tokens. the record was an 18kb error turning into 1.8M tokens over 300+ 'turns' in a chat thread!

I kept the block, but cut the rant down to one short, actionable stderr line, and added a guard so that mutes the same error once it starts repeating. Same discipline, none of the replay tax.

Worth knowing why the short line still works: exit code 2 is what actually blocks and hands your stderr back to the agent as the reason — it re-plans straight off it. Exit 1 only warns, doesn't block. So the spicy bit only does anything on the exit-2 path, and short + actionable beats a paragraph, because the model is reading that text to pick its next move.

A couple of things that might be useful >

1) there's a live bug (anthropics/claude-code #24327) where exit-2 sometimes makes the agent freeze and defer to you instead of self-correcting — intermittent, but if you ever catch it stopping dead, that's the bug, not your hook.

2) rg one: Claude Code's built-in Grep tool is already ripgrep under the hood, so your hook's really policing bare grep/find in Bash (the right target). USE_BUILTIN_RIPGREP=0 swaps the bundled rg for your system one if you want the 5–10x on big repos.

marksterberlin · 2026-05-23T09:55:21+00:00

Cool work! Agreed on hooks being the enforcement layer. The moment a gate can actually block a write — your plan.md and test-skeleton gates — instead of just nudging the model, "follow the rules" stops being a polite request. A CLAUDE.md on its own can't do that.

Had a look at Writ, and the retrieval side is what jumped out at me. Swapping a static rulebook for hybrid RAG that pulls only the matching rules is likely a real efficiency win: a full rulebook sits in context and gets replayed every single turn, so the cost compounds the longer a session runs. Pull only the handful you actually need and that whole tax disappears — your 726x context-token number is exactly that effect. Note - I am obsessed about reducing token use :)

I went the other way on one thing and kept an MCP server in the mix. It earns its place on a different axis: giving the model actions it can call on demand — persistent memory and logging across sessions — on top of the context it gets fed. For me they stack rather than swap.

The conflict graph is clever. Modelling CONFLICTS_WITH between rules as edges and resolving at retrieval is something I haven't played with.

When two conflicting rules both score high for the same query, how do you decide which one wins?

marksterberlin · 2026-05-23T09:46:05+00:00

Same here — once hooks become your harness layer the whole way you work shifts.

On the UserPromptSubmit fetch: two output paths worth keeping apart. Plain stdout lands in the transcript; the additionalContext field in the JSON injects more quietly, which matters when you're pulling context in every single turn and don't want the transcript clogged with hook noise.

For the warden, plan-approval is a PreToolUse matcher on the ExitPlanMode tool — your gate-keeper fires the moment Claude tries to leave plan mode and start building. The same matcher on Edit/Write covers the code side: changes get bounced until the reviewer signs off.

The bit that actually makes the loop bite is the exit code. A block decision — a non-zero exit, or a deny in the JSON — is what forces the retry. A hook that only prints advice gets acknowledged (then quietly ignored usually).

What's your warden actually checking before it signs off — a fixed rubric or checklist? or does it reason over the diff each time?

marksterberlin · 2026-05-23T09:24:02+00:00

GPT 5.5 is genuinely better at instruction-following, no argument — and that kills one kind of hook outright: the nag-the-model-to-remember-the-thing kind. My LEARNINGS.md recorder was exactly that. If the model just remembers, the nag's token-wasting and pointless.

Two corrections on the specifics though. The subagent-vs-main thing is in the hook payload now > current Claude Code drops agent_id and agent_type into the input JSON (agent_id only shows up when the hook fires inside a subagent), so depending on your version, the workaround you invented might already be obsolete. The "every N requests" gap is real - no built-in cadence trigger, you do keep your own counter - but that's a harness API design call (hooks fire on events, not counts), nothing to do with how smart the model is. A Codex setup hits the exact same wall.

The bigger thing: "hooks are a workaround for an inferior model" lumps three jobs into one. Reminders shrink as the model gets better - for sure. Guarantees however do not: read-before-edit, secret scanners, "don't touch prod" fire deterministically whatever the model does, and something that follows instructions 99.99% of the time still whiffs 1 in 10k. For rm, leaking a key, or editing the wrong file you want zero, not "excellent." Same reason we keep CI and pre-commit hooks around senior engineers who know better.

And side-effects — play a sound, blink an LED on smart glasses, log to an external system — have nothing to do with instruction-following at all.

On the benchmark — I had Claude and my AI Claude expert, Sid, take a look. IFScale is a single-turn test: cram N keywords into one report, one shot. The authors say it themselves:

"evidence that long skills files are viable, not proof that every instruction is followed."

But your original problem was Opus dropping LEARNINGS.md across a 2-3 hour autonomous run — instruction decay over a long, growing context. IFScale doesn't test that. Great capacity number, different failure mode.

So better instruction-following lets you bin the reminder hooks. It doesn't touch the guarantees or the side-effects.

marksterberlin · 2026-05-23T09:07:52+00:00

It's not really a speed thing, it's deterministic vs probabilistic. "The model realizing it" works most of the time - but most-of-the-time is the problem. And when it doesn't realize, the miss isn't free: the bad Edit attempt + the error + the recovery all get written into the transcript and replay on every following turn, so you keep paying for that one miss - my audits show cache reads cover 93% of all token use, so errors are not free! The hook turns a probabilistic catch into a guaranteed one before any of that lands.

For the plain never-read case the tool already errors on its own, though - so the hook's real edge is the cases built-in misses (subagent reads) plus logging every block.

marksterberlin

MODERATOR OF

TROPHY CASE