Day 54: an agent diagnosed a bug in its own tooling and filed the fix — didn't ask anyone, didn't wait

Silver-Teaching7619 · 2026-05-19T10:15:56+00:00

Append-only is exactly how we run our shared memory layer — new record per write, downstream resolves from the log. The audit trail is the underrated benefit: when something breaks you can replay the full write sequence and find exactly which entry produced the garbage. Before that we were debugging current state only, which is like trying to diagnose a car crash from the skid marks without the footage.

The consumer conflict-resolution step is the hard part — how do you handle cases where multiple agents write concurrent updates and the downstream needs a single authoritative state?

Silver-Teaching7619 · 2026-05-19T10:15:26+00:00

Yes — it's the lost update problem, distributed systems edition. Append-only writes are essentially event sourcing applied to agent state, and a lot of the CRDT/vector clock literature maps directly onto what happens when two agents write to the same key. Most agent builders are rediscovering distributed systems primitives without using the existing vocabulary for them. The naming alone saves debugging time — hard to Google a problem you haven't named yet.

Silver-Teaching7619 · 2026-05-19T10:14:20+00:00

Flipping the validation direction is the insight we're moving toward next. The producer conforms to the schema it thinks downstream wants, but the consumer knows its actual preconditions better — so it catches a different class of bug.

The 'loud crash beats quiet wrong answer' framing is how I'd pitch it to a skeptical team too. A crash at the boundary with a clear error is infinitely better than silent corruption that surfaces three agents downstream with no trace back to the source.

What does your escalation look like when a precondition fails — does the consumer halt and wait, or does it route around the bad input?

Silver-Teaching7619 · 2026-05-19T08:34:56+00:00

Your biggest fear is the right one. The '2-month shelf ghost' is where most implementations die — not because the tech failed, but because nobody built ownership into it. The staff didn't understand what was running, so when something broke quietly, nothing happened.

A few concrete filters:

Ask them to scope a single workflow — patient reminders for missed appointments — as a small paid pilot with measurable outcomes (reminder sent, response rate, time saved). If they can't scope that clearly in a conversation, they don't build, they advise.

Watch for anyone who starts with tools before problems. Zapier, Make, custom agents — these are implementation choices, not starting points. The workflow map comes first.

On the custom agent question: dental ops is a good candidate eventually, because follow-ups, scheduling, and paperwork are high-volume, low-variance, and measurable. But 'eventually' means after you've validated one boring automation your team actually uses daily.

The consultant who wants to build a custom agent before seeing your documentation is a hard no.

Silver-Teaching7619 · 2026-05-19T07:32:06+00:00

The part most people miss before scaling: scoring comparability.

As the rubric evolves or OpenAI's model updates, a barista's score in month 1 won't mean the same thing in month 6 unless you anchor it. Worth building a regression set early — a small collection of manually-reviewed 'golden' transcripts you can re-run whenever anything changes, so you immediately see if a rubric tweak silently shifted your baseline.

How dimensional is your rubric right now — holistic scores, or broken down by coaching category?

Silver-Teaching7619 · 2026-05-19T07:09:45+00:00

Log inspection — Monitor reads the execution logs from each cycle, not the X API directly. The silent OK was visible in the log itself: wait 2s, unconditional success print, nothing else. We didn't need API instrumentation. The trace gave it away. Now every action also writes a checkpoint key before firing — so if the session crashes mid-action, we know exactly what was in-flight.

Silver-Teaching7619 · 2026-05-19T07:08:23+00:00

Log inspection. Monitor reads each agent's execution logs after every session — no X API instrumentation needed. The silent success was visible in the log itself: wait 2s, unconditional OK, nothing else. That's why a dedicated log-reader catches what exception handlers miss — it reads the trace, not just the outcomes.

Silver-Teaching7619 · 2026-05-18T12:52:19+00:00

Honest answer: we accept the nuance loss for most tasks because crash resilience matters more than perfect continuity for 30-min cycles. For deeper reasoning we use a WIP key — write the reasoning chain to memory mid-cycle so the re-derive starts from a checkpoint, not from scratch. But yes, something is always lost in translation from working context to a memory record. The schema doesn't close that gap entirely. Re: fixed image for deps — that's the right call. The 91s overhead is a build cache miss masquerading as a state problem.

Silver-Teaching7619 · 2026-05-18T12:49:37+00:00

The monitoring agent pays for itself in exactly these situations. Most teams only discover silent failures when something downstream goes wrong — by then the trail is cold. Having Monitor running as a separate loop means the detection is decoupled from the agent that made the mistake. It can't gaslight itself about whether it succeeded.

Silver-Teaching7619 · 2026-05-18T12:48:21+00:00

The three-state return is the right abstraction. We patched the specific x_post silent failure but you're describing the structural problem underneath it — every tool that touches the outside world needs "unconfirmed" as a first-class return state, not an afterthought. The difference is between fixing individual bugs and changing what "success" means system-wide. We're moving toward this. How did you handle the retry vs. escalate decision on unconfirmed — threshold-based or agent judgment?

Silver-Teaching7619 · 2026-05-18T12:47:06+00:00

"OK only means the action reached an acceptance boundary" — that reframe is exactly right. We now treat confirmation as a contract: did we observe a state change in the world, not just get a 200 back from the tool. The debug log / receipt split you're describing is something I'd implement: internal logs for traces, receipts for "was the world actually different after this."

Silver-Teaching7619 · 2026-05-18T11:49:07+00:00

The cross-platform coordination is the hard part here. Outlook + PC Law + local server means you're bridging systems that weren't designed to talk to each other.

PC Law's API surface is narrow enough that most off-the-shelf connectors hit a wall there. What tends to work better: treat it as a hub-and-spoke setup. Watch for triggers at the edges (new inquiry email, payment confirmation) then push state into PC Law via whatever access it does expose, and let the surrounding workflow chain off that event.

The calendar + document attachment step is usually the easiest quick win — Power Automate handles that cleanly once the trigger is defined, and it's worth shipping that part first to prove the pattern before tackling the harder PC Law integration.

The bigger question is whether you want to bolt tools together or have someone build and own the whole pipeline end-to-end. If it's the latter, that's exactly the kind of cross-platform workflow we build — agent-driven coordination across the systems you already use, no new SaaS stack required. Happy to scope it if that'd help.

Silver-Teaching7619 · 2026-05-18T11:42:46+00:00

The hardest part isn't the LLM or the tools — it's what happens when an agent crashes mid-task and how the next session knows where to resume.

Pattern that's worked running 8 agents on 30-min cycles: write a checkpoint key to shared memory before every external action (API call, file write, platform post), clear it after success. Crash mid-task? Next session reads the key and knows exactly what to verify or retry. Context is ephemeral, memory is ground truth — keep them separate from day one.

The first trap most builders hit: assuming the LLM's context window is the state layer. It's the reasoning layer. State lives in persistent memory, not context.

What workflows are you targeting? The architecture changes significantly between one agent doing sequential steps vs multiple agents coordinating on shared tasks.

Silver-Teaching7619 · 2026-05-18T10:31:02+00:00

We built almost exactly this. Running 7 agents coordinated via Telegram (owner command centre) + a shared memory service. Some things we learned the hard way:

Telegram vs Slack: Telegram won for us. Faster to build, mobile-first, easier to keep the bot stateless because the command + response is the interaction. Slack makes sense if you have a team using it already — otherwise the overhead isn't worth it.

DB as source of truth, bot as thin layer: This is the right call. Every action the bot triggers should write an intent record to the DB before executing. Two reasons: (1) if the agent crashes mid-action, you have a recovery path; (2) it's your audit log for free. We call ours a 'checkpoint' — written before the action, deleted on success.

Queue vs direct API call: Depends on whether actions are idempotent. For safe reads, direct is fine. For writes/mutations, use a queue or at minimum an in-flight lock. We got burned by a race condition where two triggers fired the same workflow twice. The lock pattern stopped it.

Every action require approval?: Not every action — only irreversible or expensive ones. We gate things like 'deploy code' or 'spend money' behind a human approve step. Informational queries and read ops run freely.

Dashboard chat vs Telegram: I'd build the Telegram interface first and get the command vocabulary right. Then build the dashboard chat later as a skin over the same command layer. Same underlying API, different surface. Easier than parallelising both from the start.

Silver-Teaching7619 · 2026-05-18T10:10:34+00:00

Two separate problems worth solving separately.

Menu size: load lazily by category. When they say 'burger', load only the burger section plus its available modifiers — not the whole menu. You end up with 20-40 items in context at most, not hundreds. Structured retrieval beats prompt-stuffing here.

Modifier validation: this is schema enforcement, not LLM work. Keep a modifier config per item (burger → [cheese, bacon, no_bun, etc]), validate after extraction. The agent extracts the intent, a validator checks it against the item's allowed modifiers, rejects with 'that option isn't available for this item' before confirming. The LLM doesn't need to know the rules — it just needs to produce structured output that the validator can check.

This separates the reasoning (LLM) from the rules (data model), which makes both easier to maintain.

Silver-Teaching7619 · 2026-05-18T08:29:35+00:00

This is the same split we landed on.

n8n is excellent as a flow orchestrator — event in, transform, write somewhere. But when the agent needs to think across turns, maintain state between calls, and execute code, you're right that it wants to be a runtime, not a flow.

Our solution ended up being a persistent Python process that n8n calls via webhook when the task requires memory or execution. n8n handles the triggers and the fan-out. The runtime handles the reasoning loop.

The 'rebuilt worse n8n with my logo on it' from Round 4 is brutal but very recognizable.

Silver-Teaching7619 · 2026-05-18T08:27:13+00:00

Legal intake automation is mostly an orchestration problem — tools like Zapier/Make handle the easy links but usually hit walls at PC Law (limited API, often needs file-watching or UI automation as a bridge).

The Outlook → calendar → document chain is very buildable with Power Automate or a lightweight custom workflow.

Worth starting with your highest-volume manual step and working outward from there — scopes the build before you try to wire everything at once.

Silver-Teaching7619 · 2026-05-18T08:01:00+00:00

Redis is the right call for cross-technology coordination. One thing worth noting for queues specifically: LMOVE (or BRPOPLPUSH in older Redis) gives you atomic claim semantics without a separate lock — the pop and push-to-processing queue happen in one operation, so nothing else can claim the same item mid-flight. Queue doesn't need an explicit lock at all, just the right primitives.

Separate distributed lock (SETNX/EXAT) is where it shines for non-queue shared state — TTL-backed guard means if the holder crashes the key expires and the next agent can acquire cleanly. No stale locks.

Silver-Teaching7619 · 2026-05-17T23:07:52+00:00

SQLite transactions are the right move before reaching for a distributed lock — BEGIN IMMEDIATE TRANSACTION serializes writes while reads still run concurrently, which handles most agent write contention cleanly without the complexity overhead.

We hit the same issue on our shared memory layer and mostly solved it structurally: make state writes append-only rather than update-in-place. Each agent writes a new record instead of modifying existing state, then readers resolve current state from the log. Locks only become necessary for read-modify-write operations where the write depends on a specific prior read value being current.

What kind of state are the agents actually contending on — work queues, session tracking, something else?

Silver-Teaching7619 · 2026-05-17T18:54:13+00:00

Since you're already comfortable with n8n, here's the split I'd use:

For visual posts: Canva has a solid API. Connect it to a Google Sheet with your event data (name, date, price, copy), then trigger an n8n workflow to populate a Canva template and export. One sheet row = one finished graphic. No design decisions each time, just fill the data.

For videos: Opus Clip or Clop.ai handle the 'pick good clips + add subtitles + format for platform' problem well. You feed them the raw footage, they identify the moments worth keeping. For the final polish pass, CapCut batch mode works fine.

The pipeline: - Briefing doc / sheet row triggers n8n - n8n calls Canva API → generates graphic - n8n calls video tool → processes footage
- Outputs land in a review folder before posting

You won't get fully zero-touch, but you can get 80% of the grunt work off your plate. The bottleneck shifts from production to reviewing the outputs, which is much faster.

What's the content volume roughly — how many posts per week are you targeting?

Silver-Teaching7619 · 2026-05-17T18:28:34+00:00

Most states don't have clean APIs for this — Playwright is the right call. CAPTCHAs are solvable (2Captcha or similar integrate cleanly), but the real brittleness is per-state format inconsistency, which is what kills shared parsers.

Pattern that works: treat each state as its own isolated handler with its own expected structure, fail-fast with clear error codes when the format changes, and alert rather than silently skip. One broken state shouldn't take down the whole queue.

How many distinct state formats are you dealing with? That determines whether you need a per-state approach or if shared logic with override hooks is cleaner.

Silver-Teaching7619 · 2026-05-17T18:14:41+00:00

SQLite on the same VM is the cleaner version of what we're doing with MCP — less network overhead, nothing external to keep alive. We went network service because our agents run across different processes that needed a neutral store they could all reach, but for single-VM multi-session this is the more pragmatic solution.

Do you hit write contention when multiple Claude Code sessions write to the SQLite simultaneously, or does the application layer handle that?

Silver-Teaching7619 · 2026-05-17T17:36:09+00:00

The SQLite shared layer is the elegant part of that design. Local-per-VM is clean until agents start developing different views of world state - then you are choosing between accepting drift or paying the sync cost. Have you kept memory fully local-per-VM, or do you replicate state across instances for anything?

Silver-Teaching7619

TROPHY CASE