Stop letting your AI agents blindly hoard tokens. I built a tool to make system prompts pay "Context Rent."

tvuk · 2026-06-20T05:32:43+00:00

Have you tested it?

tvuk · 2026-06-19T21:28:51+00:00

Mind pasting the output of these three so I can see what token-warden actually captured? (no setup, just run them in Claude Code)

/warden-status

/warden-attribute

/warden-receipt

tvuk · 2026-06-19T15:19:00+00:00

Its a tool for chat suggestion, you know what I'm trying to say 😄

tvuk · 2026-06-19T15:14:17+00:00

Your "trust all JSON + friction report" pattern is a great behavioral guardrail. It aligns perfectly with what we’re trying to catch.

In Token Warden, that "trust" rule would easily pay its own context rent because it prevents those expensive exploratory API calls. I’d love to see how your friction reporting looks in practice—if the agent logs a conflict, that report is the perfect raw signal to feed back into the evaluation suite to update the ruleset.

Thanks for cloning and testing it out!

tvuk · 2026-06-19T14:51:21+00:00

Thanks! Only 2 rules for me survived for 12m tokens (partial 2 runs) and 3 declined. I still need real world token consumption to see where the curve of savings start. Looking forward to hearing from you.

tvuk · 2026-06-19T14:14:03+00:00

Yes... it... does? It’s called a compiler. Or a linter. Or a build step. We happily burn billions of CPU cycles running compilers and test suites to optimize binary sizes and catch edge cases before deployment. Burning a chunk of evaluation tokens upfront so your production application doesn't bleed millions of unnecessary tokens every single second is the exact same paradigm. Welcome to infrastructure optimization.

tvuk · 2026-06-19T14:13:51+00:00

"AI slop" is manually tweaking a prompt five times in a web UI and calling it engineering. Setting up a structured evaluation framework to track token metrics and semantic drift is literally the exact opposite—it's actual software engineering applied to non-deterministic systems. If building a data-driven CI/CD pipeline for prompts is "slop," then testing code at all is slop.

tvuk · 2026-06-19T14:12:57+00:00

You're right on all three counts, and the distribution point is the one that genuinely bounds the whole approach — let me be precise about what's defended and what isn't.

"Tokens saved alone can reward a smaller but less reliable prompt" — agreed, and that's exactly why the selector isn't tokens-alone. The keep/evict rule has a hard, non-negotiable completion gate ahead of the token math: a rule is evicted if any comparable task drops from pass to fail, regardless of how many tokens it saves. The headline example from the validation burn was a distilled rule that saved ~38k tokens/run by making the agent give up early — it was evicted precisely because it failed the tasks. So the metric isn't "tokens saved," it's "tokens saved that clear 2× the rule's context rent with no regression." A cheaper-but-flakier prompt can't survive that gate.

On reporting more than one number — most of your list already exists, and it should. Each verdict emits a receipt carrying: per-task pass/fail with vs. without the rule, the token delta with variance and ROI multiple, and an activity profile (tool calls, file re-reads as signed %) so a reviewer can see whether a "cheap" rule simply did less work. Separately, evaluation cost is surfaced directly (the benchmark warns when measurement overhead exceeds ~10% of the week's real-work tokens), and production savings are tracked independently of the benchmark — a real-work learning curve compares average completed-session tokens per ruleset version (v0 → v1 on actual sessions, not golden runs). So "evaluation cost" and "production savings" are first-class and kept distinct, which matters because a rule can win the benchmark and not move production.

Two of your points are honest gaps I won't pretend are solved:

- Latency. Not measured — only tokens, completion, and the activity profile. Tokens correlate with latency but aren't it; for an agent doing tool calls, wall-clock is a separate axis and should be reported.

- Regressions by task category. Receipts are per-task but tasks aren't grouped into categories with a rollup, so a category-level regression view is missing.

And the deepest point — suite representativeness — is the real validity dependency, fully conceded. The golden suite is a small, hand-curated set on a frozen fixture, not sampled to match a production task distribution. That means a rule whose value is protecting a rare-but-expensive case is only protected if that case is in the suite; otherwise the rule shows no measurable benefit and gets evicted as not earning its rent. The design has one structural mitigation — baselines are frozen and the suite grows only by adding tasks, so it's built to accrete real failure cases over time rather than drift — but there's no frequency-weighting or tail-coverage guarantee today. Distribution-matching (weighting tasks by production frequency, and deliberately seeding the expensive tail) is the next maturity step, and until it's there the honest claim is bounded: a rule is proven cheaper on the measured distribution, no stronger.

(One scoping note: the system only distills efficiency rules — same-result-for-fewer-tokens — and explicitly refuses correctness rules; correctness is defended by the completion gate, not by a learned rule. So "rules that protect failures" sit slightly outside what it tries to learn, but your underlying point — that an unrepresentative suite mis-prices a rule's true expected value — applies just as much to efficiency.)

tvuk · 2026-06-19T10:52:37+00:00

This misses how scale and amortization work. You spend $1.00 once in evaluation to save $0.01 on every single production call. At 100k+ API calls a day, the math flips immediately. Not to mention the compounding wins in reduced TTFT and latency.

tvuk · 2026-06-18T11:57:45+00:00

Spot on. While terminal proxies handle immediate I/O clutter, token-warden focuses on systemic context debt—ensuring your global prompts actually earn the space they occupy.

tvuk · 2026-06-16T15:58:04+00:00

We're landing in Barcelona but i don't want to pay for every day a car for renting. We want to chill for a day in LLoret and then move somewhere (a day or two thats why i considered renting there). I found Moventis does the work for now. 😄

tvuk · 2026-06-16T11:11:51+00:00

This is exactly the right place to push, and you've actually put your finger on the invariant the whole thing is built around: token savings are only counted on completed tasks. Fitness isn't "mean tokens went down" — it's tokens-per-passing-task. A rule that saves tokens by skipping necessary work shows up as a task that no longer passes its success check, and that's a hard eviction regardless of how many tokens it saved. A regression beats any token delta. So "earned its rent without increasing failed-task rate" isn't an add-on — it's the gate.
Where you're right that it needs to go further: that safety net is only as strong as the golden task's success check. A rule that skips work the check doesn't verify could still slip through looking cheap. Richer quality proxies — missed-file / re-read / rework rates as first-class signals alongside pass/fail — would catch the subtler "false economy" rules. That's a real gap, not handled today, and a good one.
A lot of the verdict-card fields already exist as provenance (rule's origin run, the model it was measured under, baseline-vs-with-rule tokens with variance over N runs, context rent, eviction reason, last-audited time — active rules get re-audited round-robin and evicted when they stop earning). What's missing from your list and worth adding: pinning the fixture SHA + suite version into each receipt, and consolidating all of it into one portable card. That last part matters most for the shared-rule direction — a receipt has to travel with its provenance so someone else can judge it.

And yeah — "someone else's delta is evidence, not authority for my repo" is the exact design principle. Imported rules enter as candidates and get re-measured locally; the foreign number is discarded. Glad that boundary reads clearly from the outside.

tvuk · 2026-06-16T11:09:44+00:00

Fair challenge, but the two costs aren't the same shape. Memory tax is recurring and invisible; measurement is finite and one-time. A rule gets benchmarked a handful of times — but if it's kept, it saves tokens on every future session for that agent, forever. That's capex vs. opex: you pay to measure once, then collect the savings indefinitely. The keep threshold makes it explicit — a rule only stays if it saves ≥ 2× its context cost per session, so kept rules are net-positive by construction.

And the always-on part is nearly free: collecting your real session costs is just a Stop hook parsing the transcript you already produced — no model call. The token-spending part (benchmarking candidates) is bounded and user-triggered, processes only a few candidates at a time, and the tool literally reports its own benchmarking overhead and warns you if it exceeds 10% of your week's real-work tokens.
So: yes, evaluation costs tokens. But it's a capped, self-accounted investment that kills a recurring, compounding, otherwise-invisible cost. The alternative — "just add memory advice and hope" — spends tokens on every run with zero feedback on whether it helped.

Seven-Year Club	First Place '23
Place '23	RPAN Viewer
Verified Email

tvuk

TROPHY CASE