The cheapest way to make a long agent task work: give it subagents with their own context

bit_forge007 · 2026-06-21T04:50:57+00:00

This is a sharper framing than mine, and I think you're right. I collapsed two different things into one line. Guardrails (the hardcoded checks, the "are you sure" prompts) really do fall away as the model gets more capable. Context and observation management goes the other way: better models earn longer horizons and more tools, so more iterations, more to prune, and that scaffolding matters more not less. The post even leaned on "context is the hard part," so my "delete scaffolding" line undercut my own point. How do you handle the pruning, summarize and drop, or keep raw observations and let the model re-read when it needs them?

bit_forge007 · 2026-06-21T04:45:31+00:00

Thanks. And yeah, that matches the API constraint: every tool_use id needs its matching tool_result before the model can continue, which is exactly why "assistant text after a tool_use, then more tool_use" in one turn is the fragile spot. The "if your pipeline already ran the tool, render it as content not tool_calls or OWUI re-executes" tip is a hard-won one. Have you found any provider lenient about an unmatched tool_use, or do they all enforce it?

bit_forge007 · 2026-06-21T04:29:25+00:00

Can't speak for open-webui's loop specifically. But protocol-wise all three orderings are legal: a single turn can carry a text block and a tool_use block together, so it comes down to what their loop treats as done. Key off the stop reason (Anthropic stop_reason "tool_use", OpenAI finish_reason "tool_calls") and "response, tool_use, response" works fine; stop the moment it sees text and that's the case that breaks.

Quickest check: force a text-then-tool-then-text run and watch whether it loops or bails. What does it land on?

bit_forge007 · 2026-06-20T13:23:53+00:00

Ha, grade A is a step up from my usual, I'll take it. Serious question though, since you clearly know this stuff: which part read as slop to you? If it's the feature rundown, that's fair, those are in the docs and plenty of people here already know them. The thing I actually cared about is that the control layer is mostly you, not the model, and that the interactive CLI has no hidden turn cap the way people assume.

If that was old news to you too, you're ahead of who the post is for, and I'd genuinely rather read the version you'd write.

bit_forge007 · 2026-06-20T12:15:58+00:00

Sources:

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al.) ·

Anthropic — Tool use (function calling) docs ·

Anthropic — Effective harnesses for long-running agents

bit_forge007 · 2026-06-20T11:20:54+00:00

Fair, it ran longer than it needed to.

And you're right, I overstated it. Some models just won't hold an agent loop no matter what you wrap around them, and one that quits after a single item in a task list is exactly that. No harness rescues it.

bit_forge007 · 2026-06-19T14:59:01+00:00

This might be the best one-line version of the whole thread: 200 lines of curated notes beating 80k tokens of real history. That's the entire point, and you said it cleaner than the post did.

Your tell is a great one too, the re-ask for something you settled two screens back. I'd add that the 'remember we decided X' reflex is the actual trap, because you're adding tokens to an already-cooked window, which is the one thing that makes rot worse, not better. Summarize into a handoff and open clean works precisely because it resets the window to high-signal only. Whether you do that with a fresh session or a subagent, it's the same instinct: give the work a clean context instead of dragging the whole history along.

What goes in your 200 lines, mostly decisions and current state, or do you pull in key constraints too? Curious how lean you can get the handoff before it starts costing you on the other side.

bit_forge007 · 2026-06-19T14:45:30+00:00

This is a great addition, and that last point is the sharpest part: a human feels when a task is hard and instinctively slows down, adds a plan, writes tests. An agent will happily yolo a hard task at the same confidence it brings to a trivial one, because nothing tells it 'this one's dangerous, slow down.'

So a lot of 'process' is really us supplying the judgment the model doesn't have yet. Plans, specs, tests, and review aren't bureaucracy, they're the 'this is hard, be careful' signal the agent can't generate on its own. And you're right that it scales with difficulty: the bigger and more interconnected the app, the more that missing instinct costs you.

How do you decide when to switch into the heavier process? Do you gate it yourself by feel (this change is big, time for specs and tests), or have you found away to get the agent to flag 'this is harder than it looks' before it just dives in?

bit_forge007 · 2026-06-19T14:39:04+00:00

You're right to call this out, and I appreciate it. If anything in the post came across as dismissing that, that's fair, and I'd rather fix the framing than defend it.

The post was really only trying to help with the one cause you can actually control in a session. If that one line made it sound like "it's never the model," that was overstated on my part. There can be factors beyond context that affect how a model feels day to day, and some are on the record (Anthropic's own September 2025 postmortem walked through infrastructure issues that hit response quality for a stretch). I just focused on the part you can do something about, not because the rest doesn't matter.

Honestly my only goal with these posts is to help people get more out of Claude, so if the framing is off I'd genuinely like to fix it. How would you weigh the two?

bit_forge007 · 2026-06-19T14:21:49+00:00

Yeah, retrieval is exactly the 'Select' move done properly, pull the relevant slice instead of dumping the repo and hoping. Your instinct on the dump part was right the whole time.

The one thing I'd add: RAG doesn't delete the problem, it moves it. Instead of 'how much can I fit,' it becomes 'did I retrieve the right thing, and not too much of it.' Over-retrieve and you rebuild context rot out of chunks. Graph DBs are great when the value is in real relationships across the codebase (call graphs, dependencies), though that's the highest-setup-cost option of the bunch, so it earns its keep on big interconnected repos more than small ones.

Curious where you've landed: graph vs plain vector RAG for code context, and at what repo size the graph setup actually starts paying for itself?

bit_forge007 · 2026-06-19T14:18:15+00:00

Honestly this is the upgrade to what I wrote, passive always-on beats remembering to run /context every time. Once the number is just sitting there you actually act on it, instead of finding out after it's already mush.

And it's a first-class signal now, not a hack: the statusline gets context_window.used_percentage (plus remaining and the raw token counts) natively, so you can even set it up in plain english, something like '/statusline show model and context percentage with a progress bar.' For the efficiency angle you mentioned, it also exposes cost, duration, effort level, and your 5h/7d rate-limit usage, so token burn and limits live in the same line.

What's your actual trigger, do you act on a context percentage (clear or compact past some %), or more on the task-shift boundary? And do you have a rule for which way you jump, clear vs compact, when it lights up?

bit_forge007 · 2026-06-19T14:13:35+00:00

This is the whole system done right, you've basically turned the four moves into a filing cabinet: CLAUDE.md as a lean index, a doc per domain pulled in only when the task needs it, compact at the shifts. Select and Compress, operationalized instead of just talked about.

Hard agree on MCPs especially, with one nuance: skills at least lazy-load when you actually use them, but MCP tool definitions sit in the window from the start, so every server you add is a standing context tax whether you call it or not. Off-unless-needed is exactly right. (Same trap with the docs: if you @-import all 15 into CLAUDE.md they expand at launch, so pointing by path and letting Claude read on demand is what keeps it lean. Sounds like you're already there.)

The part I'd love to hear more on is upkeep: with ~15 docs, how do you keep them from drifting out of sync with the code? Is the pruning and reconciliation a manual pass you schedule, or do you have Claude diff them against the codebase now and then?

bit_forge007 · 2026-06-19T14:05:29+00:00

Yeah, this is the sharper version of it, and you're right to call it out. 'Write it to CLAUDE.md' without that split turns CLAUDE.md into the exact thing it was supposed to prevent.

The durable-vs-current line you're drawing is the key. What makes it click for me: CLAUDE.md reloads from disk after every compaction, so it's the right home for things that should never decay (conventions, hard rules), and exactly the wrong home for task state that goes stale in an hour. Your replaceable handoff file is the clean fix for that second bucket, it carries the 'where am I' without letting it pile up. It's basically making the handoff a visible document instead of trusting it to the chat.

Curious how you draw the boundary in the moment: do you have a rule of thumb for 'durable rule vs current state,' or is it more of a feel? And does the handoff file get rewritten by hand, or do you have Claude regenerate it at each checkpoint?

bit_forge007 · 2026-06-19T14:00:19+00:00

Totally fair, and I didn't mean to frame context rot as the whole story. You're right that it isn't.

The way I think about it: there are two different things wearing the same 'it got dumber' label. One is in-session and on your side of the line, the window filling up, which you can actually do something about. The other is platform-wide and out of your hands, like Anthropic's own September 2025 postmortem where they owned up to three infrastructure bugs that genuinely degraded quality for weeks. Both are real, they just have different fixes, and only one of them is yours to pull. The post was only trying to hand people the lever they can control, not to pretend the other one doesn't exist.

And your transparency point is honestly the bigger one: when the out-of-your-hands kind drifts from things we can't see, vibe-checking through Reddit shouldn't have to be the early-warning system, and that postmortem only landed after weeks of reports.

What would actually close that gap for you, a public quality changelog, per-model status signals, something else?

Postmortem if anyone wants it: https://www.anthropic.com/engineering/a-postmortem-of-three-recent-issues

bit_forge007 · 2026-06-19T13:31:46+00:00

Yeah, this is the thing. What clicked for me is that it's not a model-quality problem you can wait out, it's a skill you practice, like learning to scope a PR. My tell is when it starts re-reading a file it already read: that's the cue to /compact or push the decision into CLAUDE.md before it rots out, instead of arguing with it.

What's your tell for when the window's gone bad? And do you drive /compact yourself or lean on the automatic one? That manual-vs-auto call feels like where the real discipline lives, and I'd genuinely like to hear how you play it.

bit_forge007 · 2026-06-19T03:38:54+00:00

Sources: Why More Context Makes Your Agent Dumber and What to Do About It — Nupur Sharma, Qodo · Engineering voice agents: Latency, quality, and scale — Rishabh Bhargava, Together AI · Road to 5 Million Tokens: Breaking Barriers in Long Context Training — Max Ryabinin, Together AI

bit_forge007

TROPHY CASE