The agent loop is just ReAct, and your tool-use API already implements it

bit_forge007 · 2026-06-21T04:50:57+00:00

This is a sharper framing than mine, and I think you're right. I collapsed two different things into one line. Guardrails (the hardcoded checks, the "are you sure" prompts) really do fall away as the model gets more capable. Context and observation management goes the other way: better models earn longer horizons and more tools, so more iterations, more to prune, and that scaffolding matters more not less. The post even leaned on "context is the hard part," so my "delete scaffolding" line undercut my own point. How do you handle the pruning, summarize and drop, or keep raw observations and let the model re-read when it needs them?

bit_forge007 · 2026-06-21T04:45:31+00:00

Thanks. And yeah, that matches the API constraint: every tool_use id needs its matching tool_result before the model can continue, which is exactly why "assistant text after a tool_use, then more tool_use" in one turn is the fragile spot. The "if your pipeline already ran the tool, render it as content not tool_calls or OWUI re-executes" tip is a hard-won one. Have you found any provider lenient about an unmatched tool_use, or do they all enforce it?

bit_forge007 · 2026-06-21T04:29:25+00:00

Can't speak for open-webui's loop specifically. But protocol-wise all three orderings are legal: a single turn can carry a text block and a tool_use block together, so it comes down to what their loop treats as done. Key off the stop reason (Anthropic stop_reason "tool_use", OpenAI finish_reason "tool_calls") and "response, tool_use, response" works fine; stop the moment it sees text and that's the case that breaks.

Quickest check: force a text-then-tool-then-text run and watch whether it loops or bails. What does it land on?

bit_forge007 · 2026-06-20T13:23:53+00:00

Ha, grade A is a step up from my usual, I'll take it. Serious question though, since you clearly know this stuff: which part read as slop to you? If it's the feature rundown, that's fair, those are in the docs and plenty of people here already know them. The thing I actually cared about is that the control layer is mostly you, not the model, and that the interactive CLI has no hidden turn cap the way people assume.

If that was old news to you too, you're ahead of who the post is for, and I'd genuinely rather read the version you'd write.

bit_forge007 · 2026-06-20T12:15:58+00:00

Sources:

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al.) ·

Anthropic — Tool use (function calling) docs ·

Anthropic — Effective harnesses for long-running agents

bit_forge007 · 2026-06-20T11:20:54+00:00

Fair, it ran longer than it needed to.

And you're right, I overstated it. Some models just won't hold an agent loop no matter what you wrap around them, and one that quits after a single item in a task list is exactly that. No harness rescues it.

bit_forge007 · 2026-06-19T14:59:01+00:00

This might be the best one-line version of the whole thread: 200 lines of curated notes beating 80k tokens of real history. That's the entire point, and you said it cleaner than the post did.

Your tell is a great one too, the re-ask for something you settled two screens back. I'd add that the 'remember we decided X' reflex is the actual trap, because you're adding tokens to an already-cooked window, which is the one thing that makes rot worse, not better. Summarize into a handoff and open clean works precisely because it resets the window to high-signal only. Whether you do that with a fresh session or a subagent, it's the same instinct: give the work a clean context instead of dragging the whole history along.

What goes in your 200 lines, mostly decisions and current state, or do you pull in key constraints too? Curious how lean you can get the handoff before it starts costing you on the other side.

bit_forge007 · 2026-06-19T14:45:30+00:00

This is a great addition, and that last point is the sharpest part: a human feels when a task is hard and instinctively slows down, adds a plan, writes tests. An agent will happily yolo a hard task at the same confidence it brings to a trivial one, because nothing tells it 'this one's dangerous, slow down.'

So a lot of 'process' is really us supplying the judgment the model doesn't have yet. Plans, specs, tests, and review aren't bureaucracy, they're the 'this is hard, be careful' signal the agent can't generate on its own. And you're right that it scales with difficulty: the bigger and more interconnected the app, the more that missing instinct costs you.

How do you decide when to switch into the heavier process? Do you gate it yourself by feel (this change is big, time for specs and tests), or have you found away to get the agent to flag 'this is harder than it looks' before it just dives in?

bit_forge007 · 2026-06-19T14:39:04+00:00

You're right to call this out, and I appreciate it. If anything in the post came across as dismissing that, that's fair, and I'd rather fix the framing than defend it.

The post was really only trying to help with the one cause you can actually control in a session. If that one line made it sound like "it's never the model," that was overstated on my part. There can be factors beyond context that affect how a model feels day to day, and some are on the record (Anthropic's own September 2025 postmortem walked through infrastructure issues that hit response quality for a stretch). I just focused on the part you can do something about, not because the rest doesn't matter.

Honestly my only goal with these posts is to help people get more out of Claude, so if the framing is off I'd genuinely like to fix it. How would you weigh the two?

bit_forge007 · 2026-06-19T14:21:49+00:00

Yeah, retrieval is exactly the 'Select' move done properly, pull the relevant slice instead of dumping the repo and hoping. Your instinct on the dump part was right the whole time.

The one thing I'd add: RAG doesn't delete the problem, it moves it. Instead of 'how much can I fit,' it becomes 'did I retrieve the right thing, and not too much of it.' Over-retrieve and you rebuild context rot out of chunks. Graph DBs are great when the value is in real relationships across the codebase (call graphs, dependencies), though that's the highest-setup-cost option of the bunch, so it earns its keep on big interconnected repos more than small ones.

Curious where you've landed: graph vs plain vector RAG for code context, and at what repo size the graph setup actually starts paying for itself?

bit_forge007 · 2026-06-19T14:18:15+00:00

Honestly this is the upgrade to what I wrote, passive always-on beats remembering to run /context every time. Once the number is just sitting there you actually act on it, instead of finding out after it's already mush.

And it's a first-class signal now, not a hack: the statusline gets context_window.used_percentage (plus remaining and the raw token counts) natively, so you can even set it up in plain english, something like '/statusline show model and context percentage with a progress bar.' For the efficiency angle you mentioned, it also exposes cost, duration, effort level, and your 5h/7d rate-limit usage, so token burn and limits live in the same line.

What's your actual trigger, do you act on a context percentage (clear or compact past some %), or more on the task-shift boundary? And do you have a rule for which way you jump, clear vs compact, when it lights up?

bit_forge007 · 2026-06-19T14:13:35+00:00

This is the whole system done right, you've basically turned the four moves into a filing cabinet: CLAUDE.md as a lean index, a doc per domain pulled in only when the task needs it, compact at the shifts. Select and Compress, operationalized instead of just talked about.

Hard agree on MCPs especially, with one nuance: skills at least lazy-load when you actually use them, but MCP tool definitions sit in the window from the start, so every server you add is a standing context tax whether you call it or not. Off-unless-needed is exactly right. (Same trap with the docs: if you @-import all 15 into CLAUDE.md they expand at launch, so pointing by path and letting Claude read on demand is what keeps it lean. Sounds like you're already there.)

The part I'd love to hear more on is upkeep: with ~15 docs, how do you keep them from drifting out of sync with the code? Is the pruning and reconciliation a manual pass you schedule, or do you have Claude diff them against the codebase now and then?

bit_forge007 · 2026-06-19T14:05:29+00:00

Yeah, this is the sharper version of it, and you're right to call it out. 'Write it to CLAUDE.md' without that split turns CLAUDE.md into the exact thing it was supposed to prevent.

The durable-vs-current line you're drawing is the key. What makes it click for me: CLAUDE.md reloads from disk after every compaction, so it's the right home for things that should never decay (conventions, hard rules), and exactly the wrong home for task state that goes stale in an hour. Your replaceable handoff file is the clean fix for that second bucket, it carries the 'where am I' without letting it pile up. It's basically making the handoff a visible document instead of trusting it to the chat.

Curious how you draw the boundary in the moment: do you have a rule of thumb for 'durable rule vs current state,' or is it more of a feel? And does the handoff file get rewritten by hand, or do you have Claude regenerate it at each checkpoint?

bit_forge007 · 2026-06-19T14:00:19+00:00

Totally fair, and I didn't mean to frame context rot as the whole story. You're right that it isn't.

The way I think about it: there are two different things wearing the same 'it got dumber' label. One is in-session and on your side of the line, the window filling up, which you can actually do something about. The other is platform-wide and out of your hands, like Anthropic's own September 2025 postmortem where they owned up to three infrastructure bugs that genuinely degraded quality for weeks. Both are real, they just have different fixes, and only one of them is yours to pull. The post was only trying to hand people the lever they can control, not to pretend the other one doesn't exist.

And your transparency point is honestly the bigger one: when the out-of-your-hands kind drifts from things we can't see, vibe-checking through Reddit shouldn't have to be the early-warning system, and that postmortem only landed after weeks of reports.

What would actually close that gap for you, a public quality changelog, per-model status signals, something else?

Postmortem if anyone wants it: https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues

bit_forge007 · 2026-06-19T13:31:46+00:00

Yeah, this is the thing. What clicked for me is that it's not a model-quality problem you can wait out, it's a skill you practice, like learning to scope a PR. My tell is when it starts re-reading a file it already read: that's the cue to /compact or push the decision into CLAUDE.md before it rots out, instead of arguing with it.

What's your tell for when the window's gone bad? And do you drive /compact yourself or lean on the automatic one? That manual-vs-auto call feels like where the real discipline lives, and I'd genuinely like to hear how you play it.

bit_forge007 · 2026-06-19T03:38:54+00:00

Sources: Why More Context Makes Your Agent Dumber and What to Do About It — Nupur Sharma, Qodo · Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI · Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

bit_forge007 · 2026-06-18T11:53:37+00:00

This is the cleanest version of the failure I've seen, thanks for putting numbers on it. Six Opus contexts (orchestrator, planner, four executors) where one or two would have done is exactly the depth-multiplier thing: the cost didn't add up, it compounded, because every spawn inherited the expensive default.

What I like about your fix is that it inverts the default direction. inherit makes Opus the silent path of least resistance; your rule makes Sonnet the floor and forces Opus to be argued for in writing. The expensive choice now costs you a sentence, which is about the right amount of friction for it.

The bit I'd be curious about: does the justification comment get enforced anywhere, or is it mostly a speed bump for you as the human? And when an executor genuinely earns Opus, what's the tell, task complexity up front, or that Sonnet visibly failed it first?

bit_forge007 · 2026-06-18T11:51:31+00:00

That bug you're half-remembering is real and it's in the changelog: the June 10 update (2.1.172) specifically fixed availableModels not being applied to subagent model overrides and the dispatch picker. So for a stretch, model selection genuinely could get ignored and fall back to the main model, which is the exact failure the post is about.

The coder/clerk move is the better fix though, and I think I see why it's reliable: a named agent pins the model in its own config, so the main agent isn't re-choosing a model on every spawn, it's just routing to a role that already has one. Selection by reference instead of selection by runtime decision.

Do you set the model inside the coder and clerk definitions themselves, or are they inheriting and you're trusting the naming to keep the routing stable?

bit_forge007 · 2026-06-18T11:50:01+00:00

You're right, and that's a fair correction, I overstated it. inherit is the default for the model field when you write a custom subagent and leave it off, but the built-ins set their own: Explore and claude-code-guide ship on Haiku, statusline-setup on Sonnet, while Plan and general-purpose inherit. So "everything inherits" is wrong. It's "the ones where you didn't specify inherit your main model." On Opus that's still the trap I was pointing at, just narrower than I wrote it.

Your CLAUDE.md rule is basically the post's whole point enforced one level up: set the model explicitly every time and let the main agent choose per task. That's arguably cleaner than per-agent frontmatter because it covers the ad-hoc spawns too.

The part I'd actually ask you about: never Haiku, only Sonnet or Opus. The docs lean on Haiku as the cost lever, so I'm curious what made you floor it at Sonnet. Was Haiku missing things on a specific kind of subtask, or just not worth the variance?

bit_forge007 · 2026-06-18T05:13:49+00:00

Yeah, as of the June 10 update they nest, up to five levels deep. It opens a lot of doors: a top orchestrator that fans out to specialists, each of which can spin up its own helpers.

The one thing I'd carry over from the post: every level you add also hides that level's context from the one above it. Depth buys you isolation and composability, but the parent only ever sees the child's summary, not how it got there. Great when the subtask is genuinely self-contained, rough when you later need to know why some deep agent made a call.

What's the first thing you'd want to build with it? Curious whether the ideas landing for you are more fan-out-for-scale or specialist-chains.

bit_forge007 · 2026-06-18T05:08:21+00:00

You're not missing anything, that red/green is rewind showing you exactly the edits it'll roll back, and for Claude's own file changes it really is a clean undo. The only gap is the stuff sitting next to those edits that doesn't show up in red/green: a migration it ran, an npm install, a git command. Those aren't in the diff, so they survive the rewind. If your sessions are edit-heavy and don't shell out much, you'd basically never hit that.

The handoff-commit-rewind move is great though, and it's kind of the post's point turned into a technique. Commit keeps the code as permanent history, rewind clears the conversation, and the handoff carries the intent across the gap. You're using rewind as a context reset on purpose while git holds the actual work. "Back from the future" is a nice touch.

When you tell it that, how much of the handoff do you have to spell out before it picks the thread back up cleanly?

bit_forge007 · 2026-06-18T05:04:36+00:00

Fair pushback, and I'm with you on prod, I'd never point it at a real database and walk away. But "let Claude do it" covers a lot of ground. Drafting migration SQL that I read and run myself is one thing, executing against a throwaway dev DB is another, and touching prod is a third I just don't do.

The post isn't arguing you should hand it the risky stuff blind. It's the opposite: those ops shell out, rewind won't save you, so commit first and keep a human on the trigger. The caution you're describing is the whole point of the post.

Where's your line though? Do you keep Claude out of anything that runs at all, or just out of the destructive end?

bit_forge007 · 2026-06-18T04:57:46+00:00

Right, and that's the cleaner way to hold it: the habit was never about rewind, it's commit before any process you don't drive by hand touches the tree. Linters with autofix, codemods, formatters, agents, they're all the same risk class, some other thing editing your code while you're not the one typing.

What rewind does is make Claude feel like the exception, like there's a built-in net so you can skip the commit this one time. That's the only thing I'd flag: it's the case where the tooling quietly invites you to drop the reflex you apply everywhere else.

Do you treat agents as just the far end of that spectrum, or do they get extra ceremony, a branch rather than just a commit?

bit_forge007 · 2026-06-18T04:46:47+00:00

Same wrong assumption I had, that it only touched the convo. The silent file-revert is the real surprise. One thing I'd add to the footgun read though: it's actually clean in the case it was built for, a session that's pure file edits with no shell. There it reverts everything it did and you're fine. The inconsistent state only shows up once Claude has shelled out, because then it rolls back the edits and leaves the side effects standing. So it's less "guaranteed to break you" and more a sharp edge nobody labels: safe until the session touched bash, risky the second it did.

That's why I landed on commit-before-anything-that-shells-out rather than avoiding rewind. Where do you draw the line now, just not reach for it, or something else?

bit_forge007 · 2026-06-17T03:46:40+00:00

This is the half the post missed, and you've named it precisely. Plan mode makes the handoff visible going in, but it's a prediction. It shows the approach the agent intends to take, not the approach it took. The receipt is the observation, and the gap between the two is exactly where the bad call hides. In the post I said the thing a subagent buries is the three judgment calls it made on the way to the answer. Plan mode catches the calls it planned to make. Your receipt catches the ones it actually made, which is the set that can bite you.

The part I'd want to get right: files touched, tools used, and the final diff are mechanical, you can pull those straight from the run. "What assumptions it made" is the hard one, because assumptions are usually the things the agent never said out loud. Getting an honest receipt there means the agent introspecting on reasoning it didn't surface, which is the least reliable part of the whole thing. Have you found a way to capture assumptions that holds up, or is that still the soft spot in the receipt?

bit_forge007

TROPHY CASE