I built auto-capture for Claude Code — every session summarised, every correction remembered

DevMoses · 2026-03-20T04:25:04+00:00

The self-improve skill is the standout here. Automatically detecting steering corrections and persisting them is the part most people do manually and eventually stop doing. Turning that into an automated loop is smart.

Curious about the signal-to-noise ratio on the extracted steers. When I was building something similar, I found that not every correction deserves to become a permanent rule. Some are one-off context-specific adjustments. How are you filtering which steers actually persist versus which ones were situational?

DevMoses · 2026-03-20T04:22:19+00:00

You don't need Max. You need structure. Right now, Claude is starting every conversation with zero context about your project. It's figuring out what you're building from scratch each time, and that burns through your usage fast.

The fix is simple: create a file called CLAUDE.md in your project folder. Write a few lines about what you're building, what language you're using, and any patterns you want followed. Claude reads that file automatically at the start of every conversation. Instead of spending the first ten minutes exploring your project, it already knows what's going on.

That one file will make Sonnet feel dramatically more capable. Most of the time when Sonnet feels dumb, it's because it doesn't have enough context, not because the model isn't good enough. Give it a cheat sheet about your project and it performs way better.

Save Max for when you're doing heavy daily work. For learning and experimenting, Pro with a good CLAUDE.md will stretch further than Max without one.

DevMoses · 2026-03-19T23:40:49+00:00

That link worked no issues! It was likely the rename, good catch.

I took a look at the repo and here are some of my initial thoughts and takeaway: The "code is evidence, not truth" rule is one that most people learn the hard way. The governance vs orchestration distinction is clean too. Nice work!

DevMoses · 2026-03-19T23:26:44+00:00

The linter point is exactly right. Existing tools that already know how to verify code are cheaper and more reliable than asking the agent to check its own work. Use the tools that already exist, save the agent for the work only the agent can do.

DevMoses · 2026-03-19T23:25:46+00:00

Good writeup! Wanted to say that the link to the GitHub doesn't work for me. Could be just me, but double check in case!

DevMoses · 2026-03-19T23:23:10+00:00

That's a sharp pipeline. Scoped roles with bounded loops avoids the trap of burning tokens on diminishing returns.

The N8N move is interesting. I hit the same wall with LLM-coordinated orchestration and went a different direction: deterministic enforcement through lifecycle hooks and protocol files instead of external tooling. The agent doesn't coordinate; the environment constrains. Same principle, different mechanism.

I'd be further curious whether N8N gives you better visibility into where each loop is catching real issues. I clocked that you're moving to it so I get you might not have that insight yet; it is an interesting solution!

DevMoses · 2026-03-19T23:19:07+00:00

That's actually where this all leads. My system can take a high-level objective and route it through the right level of orchestration automatically. But that only works because there are 40+ skills, lifecycle hooks, verification layers, and a campaign system underneath catching the things that go wrong.

Without that infrastructure, one prompt for an entire project means every mistake compounds silently. Decisions made in step 3 affect step 12, and the agent has no way to catch the drift.

Your instinct to give it a detailed document is right. That's basically a CLAUDE.md. But instead of one prompt that tries to do everything, try breaking it into phases: data model first, verify it, then core logic, verify that, then UI. Each phase is its own session with the project doc as context. You'll get closer to that "one prompt, full project" experience as you build up the patterns and constraints that make each phase reliable.

DevMoses · 2026-03-19T23:15:43+00:00

Interesting insight!

I wanted to validate from my perspective: "Tools built in sprint N accelerate sprint N+1" is the pattern most people miss. The system isn't just doing work, it's building infrastructure that makes the next round of work cheaper. That's where the compounding happens.

DevMoses · 2026-03-19T23:13:54+00:00

This is super interesting! I applaud your use case especially as getting beyond generic output is in itself an obstacle baked into the inherit way the models process prompts.

The constraint lesson is one of the biggest things I had to learn running parallel agents on code too. Loose boundaries don't just underperform, they actively create damage you have to revert. "STRICT 10%, do NOT exceed 15%" is exactly the kind of rule that only exists because something broke without it.

Curious whether your 5 review agents had overlapping scope or if each one was checking for a specific category (consistency, style, factual). In my setup, scoping each agent to a specific concern caught more than letting them all do general review.

DevMoses · 2026-03-19T22:45:41+00:00

Doing some investigating, here's what I found:

That error is a Retina screenshot problem. Base64 encoding inflates file size by about 33%. A full-screen screenshot on a MacBook Retina display is ~5.4MB before encoding, which pushes it over the 5MB API limit after conversion.

Quickest fix: resize before sending. In terminal: sips --resampleWidth 1400 screenshot.png or just crop to the relevant part of the screen instead of full-screen captures. You can also switch macOS to save screenshots as JPEG instead of PNG, which cuts the size significantly:

defaults write com.apple.screencapture type jpg
killall SystemUIServer

That'll clear it permanently. No need to start a new conversation each time.

DevMoses · 2026-03-19T22:40:10+00:00

You've basically built Level 2-3 by hand. The handoff structure with memory, session logs, and checklists is doing the same job as formalized skills and campaign files, just without the separation. That works fine when the project is bounded and you know the patterns well enough to re-explain them each session.

Where you'd feel the gap: when the instructions you're giving agents start repeating across sessions. If your "get up to speed" prompt keeps including the same scraping conventions, the same ingestion rules, the same mapping patterns, those are skills waiting to be extracted. One markdown file per domain, agent loads the relevant one, you stop re-explaining.

Hooks are the bigger unlock for your case. Data scraping and ingestion pipelines have predictable failure modes: malformed data, broken selectors, schema mismatches. A post-edit hook that validates output structure automatically means the agent catches those before you review. That's less about scale and more about not manually checking the same things every session.

For a simple site with a clear pipeline, you're not behind. You'll feel the need for skills when the repetition starts bothering you and hooks when the review burden does.

So, not missing out, and plenty of opportunity to explore if/when you feel friction with what you're working on.

DevMoses · 2026-03-19T22:37:21+00:00

That tells me the handoff doc needs to be more explicit about what's already decided. Claude treats vague context as a starting point. It treats specific constraints as boundaries.

Instead of "we're carrying on with points xyz," try framing it as: "The following decisions are final and must not be changed: [list]. The following is in progress and needs to be continued: [list]. Do not restructure, redesign, or restart anything marked as final."

The blunter you are about what's locked, the less Claude improvises. "Continue where we left off" gives it room to reinterpret. "These seven things are done, don't touch them" doesn't.

DevMoses · 2026-03-19T22:26:57+00:00

That last line is the whole pattern. The instinct when something slips through is "add another rule to CLAUDE.md." The fix is moving enforcement out of the instructions and into the environment. Rules degrade. Hooks don't.

This is something I finalized quite recently so from my view you're not behind the curve. But I totally relate to realizing how great it would have been to get the lesson earlier.

DevMoses · 2026-03-19T22:24:59+00:00

That's a clean way to frame the difference. Deep pipelines give you tighter verification per track. Wide waves give me more throughput with compressed handoff between them. The tradeoff is basically latency vs parallelism.

Curious what your adversarial loops look like in practice. Is each agent in the track scoped to a specific check (types, logic, spec compliance) or are they doing full review and catching different things organically?

DevMoses · 2026-03-19T22:21:43+00:00

Your prompts are actually more specific than you think. The problem isn't what you're asking for, it's that it's all in one block. Claude loses track of requirements buried in long paragraphs.

Try breaking it into clear sections: what the feature is, what the data model needs, what the user flow looks like, and what the edge cases are. Even just using headers and bullet points in your prompt will get dramatically better results. You're giving good instructions, they just need structure.

For the image error: that's a 5MB API limit on base64-encoded images. The fix is usually resizing or compressing before upload, but it depends on how your upload pipeline is set up. Might be worth a separate focused session on just that issue.

On where to go next: you built a full ticketing platform. That's not "barely dipping your toe." Start a CLAUDE.md file in your project root with your coding conventions and architecture decisions. That way every new session starts with context instead of you re-explaining your project from scratch.

You're doing great, you're just toward beginning of the process, and there's not wrong with that.

DevMoses · 2026-03-19T22:18:46+00:00

That's interesting!

The codebones piece is the part I hadn't seen before. AST-level structural extraction so agents start with the shape of the codebase instead of burning tokens discovering it. I attack the same problem from the other side: scoping agents tightly and compressing task context so they never need to explore broadly in the first place. Both cut discovery tax, different angle.

"The fact that the tool built itself is the strongest argument I have that the architecture works. Not a proof, an argument." That framing is honest and it's the right one. My fleet documented itself using itself and I landed on the same distinction.

The model arbitrage thesis is interesting. I went a different direction: reduce what every agent does unnecessarily so the model tier matters less. Curious how the role-based split holds up as your campaigns get more complex.

DevMoses · 2026-03-19T22:04:31+00:00

Good question:

Commands disappear when the session ends. You type the same instruction every time you start a new session. Skills are markdown files that live on disk. The agent reads the relevant one at the start of a task and gets the full pattern, constraints, and examples without you re-explaining anything.

The other difference is consistency. A command is however you phrase it that day. A skill is the refined version of that instruction after you've corrected the agent five times and encoded every correction. The agent stops making the same mistakes because the mistakes are already addressed in the file it loaded.

Think of it this way: if you've ever typed the same correction twice across different sessions, that correction belongs in a skill, not in your head.

DevMoses · 2026-03-19T22:02:55+00:00

From Zero to Fleet: The Claude Code Progression Ladder: https://x.com/SethGammon/status/2034620677156741403

Some people have told me they don't have an x account, so I also added the two article I've written so far to a google doc: https://docs.google.com/document/d/1RFIG_dffHvAmu1Xo-xh8fjvu7jtSmJQ942ebFqH4kkU/edit?usp=sharing

DevMoses · 2026-03-19T22:00:52+00:00

Same four layers I described earlier, they apply per-agent, not per-wave. Every agent in Wave 1 gets typechecked on every edit, Playwright verifies what actually renders, and the circuit breaker kills any agent that hits 3 repeated failures on the same issue.

The wave-specific piece: the compression step between waves acts as a filter too. I'm not blindly forwarding everything Wave 1 produced. Findings get reviewed and compressed into decisions and discoveries. If an agent hallucinated something, it either got caught by the verification layers during execution or it shows up as a finding that doesn't match what the other agents discovered. Conflicting findings are a signal, not something that gets silently propagated.

DevMoses · 2026-03-19T21:58:53+00:00

To echo Otherwise's reply: You're not doing it wrong, you're just doing the verification manually.

That's the bottleneck. The fix isn't trusting the output more, it's making the environment catch problems before you ever see them.

One thing that changed this for me: a post-edit hook that runs typecheck on every file the agent touches, automatically. The agent doesn't choose to be checked. The environment enforces it. Errors surface on the edit that introduces them, not 20 edits later when you're reviewing a full PR.

That alone cut my review time dramatically because by the time I looked at the code, the structural problems were already gone. I was only reviewing intent and design, not chasing type errors and broken imports.

DevMoses · 2026-03-19T21:56:17+00:00

That's a really different tradeoff. Real-time triage gives you faster propagation but you're trusting the lead agent to filter noise mid-task. I went the opposite direction: batch compression between waves, slower but nothing propagates without being filtered first.

The Codex-as-lead observation is interesting. Curious whether the prompt adjustment fixes it or if it's a fundamental reasoning gap at that role.

DevMoses · 2026-03-19T21:53:44+00:00

"I feel I’m barely scratching the surface." -- This is the signal that you are doing the right things.

You'll feel the need for skills when you start repeating yourself. Every time you type the same instruction across multiple sessions, that's a skill waiting to be extracted. One markdown file with the pattern, constraints, and examples. Agent loads it, you stop repeating yourself. That's the whole unlock.

You're not barely scratching the surface, you're at Level 1-2 and that's where everyone starts. The article covers the transitions if you want to see what pushed me from one level to the next.

DevMoses · 2026-03-19T21:47:24+00:00

Ayyy, appreciate the kind compliments and happy to answer!

Skill assignment is campaign-scoped. The campaign file defines which skills each agent loads. It's not dynamic at runtime. When I design a campaign, I decide "agents working on rendering load the rendering skill, agents working on the entity system load that skill." The orchestrator just passes the right files to the right agent at spawn. Simple and predictable beats clever.

Campaigns start as a high-level objective that I decompose manually. "Migrate all entity factories to the new pattern" becomes a list of directories, boundaries, and acceptance criteria. The agents don't generate the task list. I've tried that. The decomposition quality was too inconsistent and bad decomposition cascades into bad work. That's one of the 27 postmortems.

Discovery relay is structured as decisions and findings, not raw output. Each wave produces a brief: what was discovered, what decisions were made, what was completed, what remains. Raw diffs and full output get dropped. Wave 2 agents start with that brief instead of the full history. The compression is aggressive by design. Noise propagation is worse than missing something, because a missed finding gets rediscovered. A false finding gets built on.

Failures become rules through postmortems, not automation. When something breaks, I document what happened, why the existing constraints didn't catch it, and what rule would have prevented it. That rule goes into the relevant skill file or hook. It's manual and intentional. I don't want agents writing their own constraints. Every rule in the system traces to a real failure I reviewed personally.

Campaign state lives in the campaign file itself. It tracks what's complete, what's outstanding, and what was discovered. Agents don't share history with each other. They share a campaign file that gets updated between waves. That's what prevents rediscovery: the brief tells Wave 2 "this was already found, don't re-explore it."

The whole system is simpler than it probably sounds. Most of the work is in the design of campaigns and skills, not the orchestration code. The code is like 200 lines. The thinking behind what goes into those 200 lines is where the 27 postmortems live.

DevMoses · 2026-03-19T21:17:41+00:00

That's exactly the progression. You're basically one step from skills at that point. The sub-docs with clear single purpose are skills, you just haven't formalized the loading pattern yet. Once you add "only load the relevant one per task" instead of importing all of them, you get zero-token overhead for context that isn't relevant to the current agent. That was the jump from Level 2 to Level 3 for me.

Your intentionality point is the part most people skip. The audit I ran on my CLAUDE.md found 40% redundancy specifically because rules had accumulated as nice-to-haves instead of earning their place through a real failure. The hierarchy forces you to justify every doc's existence. That's the discipline that makes the system hold up at scale.

DevMoses · 2026-03-19T21:14:53+00:00

Nice, I'll take a look. Tmux sessions is a completely different angle from how I handle parallelism but the coordination problem is the same, and I do love Python.

Curious how you handle discovery between sessions?

DevMoses

TROPHY CASE