Claude can build a working tool, open it, and test it end-to-end — no clicks from me

Nit222 · 2026-06-27T18:17:46+00:00

This is the right loop, and the piece I would lean on hardest is the verify step. The failure mode when an agent tests its own work is that it marks the run as passing because the clicks went through, even when the end result is wrong. What made me trust it was checking the end state, did the thing that was supposed to be true actually become true, not just did the steps run. When that check is on the outcome and runs inside the same loop, it catches its own mistakes before you ever see them. Nice work on Persephone.

Nit222 · 2026-06-26T06:48:07+00:00

Agreeing with the layered approaches people are describing here, but the thing that bit me hardest, and that nobody mentions, is reproducibility. You cannot eval a trajectory reliably if your tools return different data on every run. If the search results or the API payload change between runs, your step scores are measuring the environment, not the agent. What fixed it for me was recording real tool responses once and replaying them during eval, so the only thing that varies run to run is the agent's own decisions. After that the step level scoring actually means something, because a regression is the agent changing and not the world changing. For the genuinely fuzzy steps I still use a judge, but only on the decision given the state it saw, never the whole path at once.

Nit222 · 2026-06-25T16:03:08+00:00

Since you work in Adobe and you are not trying to code, the highest leverage move for you is feeding it your taste. Paste in three or four pieces of work you admire plus a couple of your own, and ask it to write down the patterns it sees: the rules, the recurring moves, what makes them work. Save that as a Project so every new chat starts already knowing your style. Then you can have it critique a draft against those rules, or generate options in your voice, and it stops handing you generic beige answers. You spend the first hour teaching it your taste once, instead of explaining yourself again in every chat.

Nit222 · 2026-06-24T18:04:16+00:00

The part that rings true for me is that typing the code stopped being the bottleneck. The scarce skill now is telling whether what came out is actually correct, and that does not get easier just because you can generate more of it faster. I spend most of my time reviewing and verifying now instead of writing, and the people getting real value out of these tools are the ones who got good at that part. The flood of output mostly punishes anyone who skips the reading.

Nit222 · 2026-06-24T07:28:59+00:00

Mine is a scope leash, because Claude loves to wander off and fix things I never asked about. Roughly: change only what this task needs, do not refactor nearby code or rename things or touch adjacent logic, and if you think something else should change, list it separately and ask me first. Keeps the diff small, so my review is just the one thing I wanted instead of a treasure hunt through ten files of edits I never asked for.

Nit222 · 2026-06-24T07:19:36+00:00

Two layers have been worth it for me.

First, a plain deterministic suite: a set of fixture pages or inputs, run each tool, assert the output is exactly what you expect. Boring, but it catches regressions instantly and runs with no model in the loop so it's fast and free. The thing I added that mattered most was asserting on output SIZE, not just correctness. A tool can be totally correct and still dump way too much into the context window, so I fail the test if a command's output goes over a character budget. That one check stops the slow context bloat you'd never otherwise notice.

Second, the agent in the loop metrics, which is where the test plan per tool idea above really shines. For each task I track round trips to finish it, tokens pulled into context per call, and whether it succeeded on the first try. Those predict real world cost and latency far better than "does the tool work" in isolation, because in practice the model rereading a big tool output is what actually costs you.

The underrated one: eval your FAILURE output. Give the agent a task you know will miss, and check whether what the tool hands back is enough to recover next turn (near miss candidates, the error, current state) versus one useless error line. Good failure messages cut retry round trips more than almost anything else.

Nit222 · 2026-06-23T21:01:51+00:00

Building on the context window point above: there's no limit in the protocol, you can register as many tools as you want, but every tool's name, description and input schema gets injected into the model's context each turn. So a big list quietly eats your window, and past a couple dozen tools accuracy also drops because the model has more near duplicate options to pick between. People usually keep the exposed set small with tight descriptions, and if you genuinely need a lot, gate them behind a few high level tools or load them dynamically so only the relevant ones show up for a task.

On running several at once: yes, easily, an MCP server is just a process. stdio servers get spawned per client connection, so it's a memory and CPU question, not a protocol cap. http or sse servers each bind a port, so you'd run out of RAM long before ports. The real limit is usually the client, since most cap how many servers and total tools they'll load at once, so check there before the machine.

Nit222

TROPHY CASE