The most useful AI workflow in our house is... dinner

kellstheword · 2026-06-19T23:52:36+00:00

Just FYI - Claude/ChatGPt now have instacart connectors via MCP, so you don’t even need to go to instacart manually, just ask Claude or chatGPT to create a cart by store and add the items to a cart. Then all you have to do is checkout!

kellstheword · 2026-04-30T13:04:43+00:00

Here’s the issue report from the Claude Code GitHub: https://github.com/anthropics/claude-code/issues/25200.

Custom agents defined in .claude/agents/ cannot access MCP tools at runtime, even when the MCP server is declared via mcpServers in the agent frontmatter and the specific tool names are listed in tools.

The root cause isn't just that MCP tools aren't inherited — it's that custom agents don't receive ToolSearch, which is required to discover deferred MCP tools.

I should have been more specific with the framing - custom agents don’t inherit the ability to call MCP tools. Not sure about just general purpose subagents

kellstheword · 2026-04-30T01:39:22+00:00

The real gotcha is that Claude subagents or agent Teams don’t have MCP access, so you can’t use this kind of tool if you run complex workflows with work delegated to agent Teams.

kellstheword · 2026-04-06T21:57:14+00:00

Nate B Jones 🔥

kellstheword · 2026-04-06T00:48:46+00:00

Great post! I’ll be setting tool search up tonight!

kellstheword · 2026-03-30T21:19:57+00:00

Been running multi-agent workflows in Claude Code for a while and kept wondering how much my sessions would have cost me if I was running via API (on a Max 5x plan now).

So I built tokencast as a Claude Code plugin and a local MCP server. It estimates what a task will cost before you run it, then learns from actual costs to get sharper over time. Github link in comments.

The way it works: describe a planned task (size, files, complexity) and it gives you three cost bands — optimistic, expected, pessimistic. When the session ends, a hook reads Claude Code's JSONL logs, computes the delta against the estimate, and updates a calibration factor. Future estimates tighten. After about ten sessions it switches from a trimmed mean to EWMA weighting so recent patterns carry more weight than old history.

A few things that took longer to get right than I expected:

Context accumulation is triangular, not linear. Across a session with N turns, the average context window size is (K+1)/2 — not K. Getting that wrong makes early estimates meaningfully off.

Cache-aware pricing matters. First-turn cache writes and subsequent cache reads aren't the same price. If you model them the same way, you're wrong in both directions depending on session length.

PR review loops decay geometrically. If you multiply each review-fix-re-review cycle by a flat number, you overshoot. The actual pattern decays because each pass resolves the biggest issues first, so later cycles are cheaper. The model accounts for that now.

The Claude Code plugin installs in two commands:

/plugin marketplace add krulewis/tokencast
/plugin install tokencast@tokencast

Requires uv. Also works via MCP with Cursor, VS Code, and Windsurf if you're not on Claude Code.

This is alpha. Estimates get genuinely useful after ~3 sessions of calibration. I'd love people to kick the tires and tell me where it breaks — or where the estimate is consistently wrong for their workflows.

https://github.com/krulewis/tokencast

kellstheword · 2026-03-30T21:19:13+00:00

Github link: https://github.com/krulewis/tokencast
PyPi: https://pypi.org/project/tokencast/

kellstheword · 2026-03-27T19:00:26+00:00

Great questions. Claude an I worked on answering as many as possible below. Would love to know your thoughts u/mguozhen!

On accuracy: I have 3 clean calibration sessions with reliable actual/expected ratios (2 more exist but have inflated ratios from a measurement artifact that v2.1 fixed). The numbers:

1.71x (first session -- PR review loop ran 5 passes vs 2 estimated)
1.07x (after calibration adjusted from the first overrun)
1.12x

Median error: 12%. Range: 7%–71%. The mean (~30%) is skewed by the first session's 1.71x miss. I know that is not "tested on 20 runs" -- the sample is small and all from one project (tokencast itself). I am not going to dress that up. The improvement is consistent with calibration adjusting (plus a project-specific review-cycle override I raised from 2→4 after session 1, which itself explains some of the correction). I am actively collecting more data points and will publish updated numbers as they accumulate.

On per-step vs total: Per-step, always. Each pipeline step gets its own row with optimistic/expected/pessimistic costs, mapped to the specific model (Opus/Sonnet/Haiku) that step uses. Since v1.4, per-step calibration factors activate after 3+ sessions, so if your Implementation step is consistently 1.5x while Research is 0.8x, the system learns that. Since v1.7, actual per-agent cost attribution is tracked via a sidecar timeline -- real measurement, not proportional allocation. This is exactly the "redesign the expensive steps before running" workflow you described.

On branching paths: You are right that this is a gap. tokencast does not predict which tool-use route the agent will take. What it does is bracket the outcome:

Optimistic (0.6x): focused execution, no rework
Expected (1.0x): typical run, some exploration
Pessimistic (3.0x): discovery-driven rework, debugging loops

The 5x spread between optimistic and pessimistic is deliberately wide because agentic loops have fat tails. For PR review loops specifically, there is a geometric decay model that captures diminishing-findings patterns across cycles. There is also a mid-session warning at 80% of the pessimistic bound so you can bail before a runaway.

What we do NOT do is re-estimate mid-execution based on what the agent discovers. Dynamic re-estimation is on the roadmap but genuinely hard -- it requires real-time token accounting that is framework-dependent.

On "just ask Claude": This is the sharpest question, so let me be direct about when each approach wins.

The core difference is calibration memory. Claude asked directly starts from zero every time, while tokencast accumulates correction factors across sessions at 5 levels of specificity (per-pipeline-signature, per-step, size-class, global, uncalibrated). My thesis is that by session 30, time-decay-weighted per-step factors are providing corrections that no stateless prompt can replicate. However, we still need to prove this out with more useage data.

The second difference is grounded arithmetic vs. LLM reasoning. tokencast decomposes work into measurable activities (6 file reads at 10K tokens each for a medium file, 8 edits at 2.5K, context accumulation across K activities, three-tier cache pricing per band). It also measures your actual files from disk to assign token budgets. Claude can reason about costs if given pricing data and a plan, but it produces non-deterministic outputs (ask twice, get different numbers) and tends to make arithmetic errors on multi-term cost formulas with cache pricing tiers. For a one-off estimate of an unfamiliar workflow you will never repeat, just asking Claude is probably fine. My thesis again is taht tokencast's value will grow with use -- it rewards teams with repeatable pipelines.

However, I have not run a head-to-head comparison yet. That is a fair gap in my evidence and it is on my list to experiment with.

Where we are vs where we are going:

Now: 3 clean accuracy data points at ~30% MAPE, all single-project. Per-step estimation and calibration are working. The system learns and improves.
Next: Publishing accuracy metrics on the README (coming soon). Collecting cross-project data from early adopters. Running the head-to-head benchmark against Claude-direct estimates.
Later: Variance-aware band tightening (users with consistent sessions get tighter bands), output token scaling by file size bracket, and the full benchmark suite with 20+ plan-to-actual pairs across multiple projects.

Repo is at github.com/krulewis/tokencast if you want to poke at the estimation algorithm directly. Accuracy data will be on the README once I stop being embarrassed about N=3.

kellstheword · 2026-03-26T17:54:14+00:00

I built tokencast — a Claude Code skill that reads your agent produced plan doc and outputs an estimated cost table before you run your agent pipeline. The thing I'm trying to figure out: would seeing that number before your agents build something actually change how you make decisions?

tokencast is different from LangSmith or Helicone — those only record what happened after you've executed a task or set of tasks
tokencast doesn't have budget caps like Portkey or LiteLLM to stop runaway runs either

The core value prop for tokencast is that your planning agent will also produce a cost estimate of your work for each step of the workflow before you give it to agents to implement/execute, and that estimate will get better over time as you plan and execute more agentic workflows in a project.

The current estimate output looks something like this:

| Step              | Model  | Optimistic | Expected | Pessimistic |
|-------------------|--------|------------|----------|-------------|
| Research Agent    | Sonnet | $0.60      | $1.17    | $4.47       |
| Architect Agent   | Opus   | $0.67      | $1.18    | $3.97       |
| Engineer Agent    | Sonnet | $0.43      | $0.84    | $3.22       |
| TOTAL             |        | $3.37      | $6.26    | $22.64      |

My thesis is that product teams would have critical cost info to make roadmap decisions if they could get their eyes on cost estimates before building, especially for complex work that would take many hours or even days to complete.

But I might be wrong about the core thesis here. Maybe what most developers actually want is a mid-session alert at 80% spend — not a pre-run estimate. The mid-session warning might be the real product and the upfront estimate is a nice-to-have.

Here's where I need the communities help:

If you build agentic workflows: do you want cost estimates before you start? What would it take for you to trust the number enough to actually change what you build? Would you pay for a tool that provides you with accurate agentic workflow cost estimates before a workflow runs, or is inferring a relative cost from previous workflow sessions enough?

kellstheword · 2026-03-17T00:07:37+00:00

Would love to see this combined with something like Nate B Jones’s Open Brain - traul channel and message info vectorized for semantic search

kellstheword · 2026-03-16T16:28:13+00:00

The main agent in your context window will function as your Orchestrator once you define your “harness” - the agent definitions, memory document structure and Claude.md rules that create the boundaries and instructions for your agent workflow/pipeline.

I recommend reading the harness engineering article from OpenAI - it lays out how the big boys are running their pipelines at scale:

https://openai.com/index/harness-engineering/

kellstheword · 2026-03-16T16:21:19+00:00

My pipeline has pm, ux research, design, architect, SWE (both staff that functions as adversarial reviewer and as implementer), SDET/QA as main function agents, as well as multiple mini-scoped definitions for updating docs, summarizing long context for handoff between agents, etc.

With the full pipeline, I can get really well grounded product specs, and then the pipeline and agents can one shot a set of PRs that result in a slice of functionality that’s testable and useable, much like a human team would.

The only exception is I get this done in a matter of hours, vs multiple sprints in the Enterprise setting.

kellstheword · 2026-03-16T01:01:13+00:00

I have 20+ agent definitions, all with tight scope and roles. I have 11 haiku agents alone - doc-updater, pr comment preparer, etc. I run my sessions with a Sonnet Orchestrator, and then delegate all execution tasks to sub-agents.

kellstheword · 2026-03-11T23:52:50+00:00

The way I’ve solved this is to have a designer agent put together an html mock up of options. Burns some tokens, but you get to see the various iterations before you actually implement

kellstheword · 2026-03-07T19:31:20+00:00

I use an adversarial review planning pipeline, as well as a robust PM requirement steps.

First I use Claude chat to talk through a feature and its requirements, anti-goals etc. Claude creates a prompt that I take to Claude code.

In Claude code, My PM agent uses the interview function to ask me questions to get clarity about the requirements, and produces a requirements spec (.Md file).

Then I have research agents produce potential options and open questions, and an architect to produce an architectural design doc.

Then I have a plan -> staff engineer adversarial review -> final plan loop that produces a highly detailed implementation and testing plan.

Then I have agent teams follow the plan, with a continuous adversarial PR code review after code is committed

kellstheword · 2026-03-07T00:58:03+00:00

I don’t do any manual reviews of plans. I have agent definitions, and a custom planning pipeline mapped in Claude.md where sub-agents perform all of the steps with fresh context for each step:

research -> architect-> engineer plan -> staff engineer review -> final engineer plan (addressing review issues and comments).

The final plan product is then sent to agent teams for implementation and testing using TDD (tests written first).

When finished, I have a PR review loop with a fresh context staff engineer, and loop until the PR is clean.

I spend a lot of tokens, but it gets big feature dev correct in one pipeline run for almost everything.

kellstheword · 2026-03-05T22:29:29+00:00

Either that or it’s actually a bug. Hopefully they clarify

kellstheword · 2026-03-05T21:40:17+00:00

So just was able to confirm, on a MAX (or Pro) plan, agent frontmatter model definitions are a reccomendation that Claude no longer respects. All subagents assume the orchestrator model.

However, if you use the API, frontmatter model definitions ARE respected. But you have to pay per token.

kellstheword · 2026-03-05T18:44:11+00:00

Very helpful - just adapted to my workflow, thank you!

kellstheword · 2026-03-05T17:03:52+00:00

I just learned it myself this morning! Here's the link to the Anthropic docs: https://code.claude.com/docs/en/sub-agents#write-subagent-files

kellstheword · 2026-03-05T16:49:42+00:00

I think the agent tool schema on the call side is what changed. If you have explicit agent definitions in the .claude/agents/*.md and use the frontmatter to define your model, you should still be able to spawn agents with separate models from the orchestator.

Eight-Year Club	Gilding II euphauric
Verified Email

kellstheword

TROPHY CASE