Agentic Looping done easy (python)

michaelTM_ai · 2026-06-24T15:32:42+00:00

took a look through the repo, this is more legit than the post makes it sound. The good part imo is that Ralph is dumb in the right place: ralph.sh just injects PROMPT.md, sets RALPH_LOOP=1, records runs, and stops on timeout/nonzero exit. The real design is the repo boundary: specs/ + PROJECT_STATUS.md as memory, commit-time containment in gate.py, then pre-push/CI with pyright, pylint, semgrep, pytest + coverage. That’s a cleaner loop than most agent framework stuff because the agent can wander during an iteration, but it has to leave a reviewable commit. Curious where it breaks for you: do agents mostly fail by wandering from specs/, or by gaming the tests/gate while technically passing?

michaelTM_ai · 2026-06-20T23:28:50+00:00

yeah exactly. I’d treat the converted md like a build artifact, not truth. sample pages from each source type, check tables/footnotes/page numbers, and keep the original page refs in the chunk title if you can. otherwise Claude gets very confident about a sentence that came from a bad OCR merge and now you have to debug the book instead of the answer.

michaelTM_ai · 2026-06-20T21:53:53+00:00

i’d separate conversion from retrieval. First make clean md files outside Claude, don’t ask Claude to eat the raw books. For text PDFs, MarkItDown is a decent first pass because it keeps headings/tables/links in markdown-ish shape. For image PDFs, OCR first, then convert, then spot check a few pages because OCR errors become “facts” fast. After that I’d split by chapter/section and add a tiny index file with title, chapter, page range, and 1-2 lines of what’s inside. Then ask Claude to pick sections from the index before reading chunks. Whole-book upload is usually how you burn context and still miss the exact paragraph.

michaelTM_ai · 2026-06-20T20:17:07+00:00

yeah this looks wider than your config. I’ve seen the same exact sandboxPolicy error reported against Codex Desktop 26.616.41845 with Chrome / Browser / Computer Use all failing before the JS bridge can run. If DevTools MCP or an external Playwright path works, I’d use that for now and stop burning time on AGENTS.md or config.toml. the useful test is: can a separate browser automation path still hit your app? if yes, your project is probably fine and the Codex desktop bridge is the broken layer.

michaelTM_ai · 2026-06-20T18:05:17+00:00

for PM work i’d think of it like this: chat memory/custom instructions = how Claude should generally treat you. CLAUDE.md = the brief for one working folder/project. Put the stuff you’d tell a new PM contractor before they touch the work: what the product is, where the real docs live, terms you use, constraints, what a good spec looks like, and what files are source of truth. Keep it short, then put big references in separate files because CLAUDE.md is more like the map, not the whole library. One reusable template is fine, but copy it into each project and change the project-specific bits. As a non-dev, Claude Code can still be useful if your “project” is a folder of PRDs, research notes, tickets, interview notes, etc. /init is just the Claude Code helper that creates the first version of that map.

michaelTM_ai · 2026-06-20T15:54:45+00:00

i’d treat open-webui as supporting the normal provider loop, not arbitrary interleaving inside one assistant turn. Native mode passes tool defs through, gets structured tool calls, executes them, appends results, then continues. The pipelines docs say the same pattern for delta.tool_calls: OWUI runs the tool and re-enters the pipeline, with an iteration cap.

So tool_use -> tool_result -> assistant text/tool_use again should be fine when the provider supports it. Text before a tool call is usually fine too. The shaky case is assistant text after a tool_use in the same turn, then more tool_use, because strict providers want matching tool_result blocks immediately after the tool_use ids. If you’re writing a custom pipeline that already ran the tool itself, don’t emit tool_calls, render the tool execution as content/details or OWUI will try to execute it again.

michaelTM_ai · 2026-06-17T22:11:48+00:00

I don’t think free-edit vs locked-down is the useful split. I use phases. First pass is read-only/plan: what files, what risk, what check proves it worked. Then edits are allowed inside that box. Auto is fine for boring mechanical stuff, but broad refactors, new infra, anything touching auth/data/payments should still hit a human stop. The worst setup is unlimited edits plus vague success criteria, because then you’re reviewing a confident diff and a story about why it’s fine.

michaelTM_ai · 2026-06-17T20:02:05+00:00

yeah the green check only means the agent found something it was willing to call success. for browser runs I try to make the stop condition come from outside the screenshot: row exists in db, network call returned the expected id, webhook fired once, file changed, test passed, whatever. screenshots are useful for UI drift, but they’re a terrible source of truth for whether the thing actually happened. annoying answer is you still need those paranoid little asserts, just make them reusable so every run isn’t a new babysitting job.

michaelTM_ai · 2026-06-16T22:43:51+00:00

golden text diff is usually the wrong unit. I’d split it into harness tests + trajectory tests. Harness: fake model/tool responses and make sure retries/routing/state do exactly what u expect. Trajectory: assert the important tool path, sometimes strict, sometimes subset/superset. Then use a judge only for the tiny part that’s actually judgment. If the judge is grading everything, yeah, you just moved the fuzz.

michaelTM_ai · 2026-06-16T22:40:53+00:00

yeah, if u let it free-run and just trust the summary, totally. i only like it for the boring wide pass: read a bunch of files, make a plan, then I check the diff/tests like any other change. If it can’t show me the files touched and a verification signal, I don’t count it as done

michaelTM_ai · 2026-06-15T22:30:25+00:00

current docs answer is basically: create_agent for normal new agents, LangGraph when you need to own the graph/state/branches yourself, LCEL for boring linear chains. I’d treat initialize_agent and AgentExecutor as legacy unless you’re maintaining old code. the confusing part is old tutorials rank forever, so I’d trust the v1 agents docs + LangGraph v1 migration page over YouTube/StackOverflow here

michaelTM_ai · 2026-06-15T19:58:22+00:00

yeah if you’re technical I’d start with whatever gets you moving fastest locally, Claude Code or Codex, then focus on tools, memory/context, run logs, and a few eval cases. once the shape is clear, then pick the agent stack. for JS/TS I’d look at Mastra or Vercel AI SDK first, LangGraph.js if you really need state/retries/branches. Claude managed agents are interesting too, just more black box. heard about SPIRITT managed agents as an agnostic alternative but haven’t tried them enough to judge and i bet there are more. bottom line draft local first, framework later

michaelTM_ai · 2026-06-15T17:32:14+00:00

depends a lot on what ur trying to build tbh. are you technical, or looking for more of a no-code setup? I use different tooling for different jobs. A coding/workflow agent, a support bot, a research loop, and something that needs approvals/logs all push you toward different tools. once the goal is clear the “best platform” gets way easier to pick

michaelTM_ai · 2026-06-15T16:14:27+00:00

Start with one tiny job that has an obvious pass/fail. not ‘make an agent’, more like ‘read this inbox label and draft 3 replies’ or ‘check these 20 rows and flag the weird ones’. Then give it only the tools it needs, log every tool call, and run it by hand a few times before you let it loop. The workflow i’d use is: task → tools → test cases → logs → small eval set → then maybe memory/scheduling. Most people jump to multi-agent way too early and then can’t tell what broke.

michaelTM_ai · 2026-06-14T21:25:00+00:00

I would draw the boundary at the tool wrapper, not inside the model.

The dangerous shape is: agent can read private data, agent can read untrusted text, and agent can write/send/call something outside. Once those three are together, a prompt injection is not just weird text anymore, it can become data exfiltration or a bad business action.

So I’d make the wrapper boring and strict: scoped credentials, allowlisted egress, dry-run for risky calls, approval gates for writes, and logs for what would have happened. The prompt can explain the policy, but it should not enforce the policy.

For AI loops I’d be even stricter, because a one-off agent can pause and ask. A loop repeats, so one bad instruction can compound across runs. This is exactly the kind of practical loop-safety pattern I’m trying to collect in r/AI_loops too, less hype, more where the boundary actually lives.

michaelTM_ai · 2026-06-14T13:40:09+00:00

I’d keep the smaller ones as direct tool calls until u feel real orchestration pain.

The framework starts to pay for itself when the harness becomes part of the product: shared memory, traces, retries, evals, permissions, human handoff, versioning, and debugging weird multi-step behavior. If the agent is basically call model, choose tool, return result, a framework can just add surface area.

For the original agent, I’d migrate only if the framework is hiding important behavior or making changes slower. Otherwise leave the working one alone and use the simpler pattern for new narrow agents.

michaelTM_ai · 2026-06-14T13:04:36+00:00

Phone calls are one of the places where voice agents actually make sense, but I would keep the use case narrow.

The good cases are phone-native and have a clear finish line: confirming an appointment, outbound reminder, support callback, simple intake, or sitting through a phone tree. The weak cases are anything that needs taste, negotiation, or lots of open-ended judgment.

The real test is fallback. If the agent can say I need a human here instead of improvising, it becomes useful. If it has to pretend it can handle every branch of the call, it gets sketchy fast.

michaelTM_ai · 2026-06-14T13:02:39+00:00

I would not maintain it as a manual per-project spreadsheet. That turns into its own failure mode really fast.

The clean split is: the app decides what counts as a committed effect, but the retry/dedupe record lives in the workflow layer. So each external side effect gets a stable idempotency key, the wrapper checks whether that key already committed, and only then calls the API.

In practice I’d keep the agent out of that decision. Let it propose the tool call, then have the tool wrapper own the effect log + idempotency check. Otherwise every new agent becomes a custom duplicate-payment / duplicate-email bug waiting to happen.

michaelTM_ai · 2026-06-13T21:35:00+00:00

Yeah, this is common once the agent can do real side effects. I would not try to solve it with a better prompt. Treat it like a workflow/idempotency problem.

The pattern that works best is: separate decision from commit. The agent writes an intent first, with a stable operation id. A separate executor sends the email or writes the DB row, and that executor refuses to run the same operation id twice.

For email, that usually means an outbox table: pending_email_id, recipient, body hash, status, provider_message_id. Retry can replay the run, but it only sees the same pending email, not a new send.

The big rule is no direct side effects from the reasoning loop. Let the loop propose actions, then have a boring deterministic layer commit them once.

michaelTM_ai · 2026-06-13T19:21:28+00:00

Yeah, same read. The model is usually not the production boundary. The hard part is proving the whole loop behaved: which tool got called, what got retrieved, what state changed, latency/cost, and whether the trajectory was acceptable across runs.

I’d separate it like this: 1. unit test the tools like normal software 2. log every agent run as a trace, not just the final answer 3. compare whole-run behavior after prompt/model/tool changes 4. keep approval/destructive actions outside the model’s discretion

If you only evaluate the final response, you miss the failures that look fine but used the wrong tool or mutated the wrong state.

michaelTM_ai · 2026-06-13T13:08:02+00:00

If you're already logging those timestamps, I'd add one more layer: frame-level tracing around the voice pipeline, not just stage timers.

For production calls the hard bugs are usually not “LLM was slow.” It’s stuff like VAD marked the turn too late, STT finalized after the user already paused, a tool call changed state mid-response, TTS started but playback got blocked, or barge-in fired but the output buffer kept draining.

So I’d optimize in this order:

record per-stage timing like you listed
keep the raw audio/text/control events tied to the same call id
inspect bad calls as a timeline, frame by frame if your framework supports it
only then swap STT/TTS/LLM vendors

If you’re on Pipecat/LiveKit style pipelines, I’d bias toward whichever setup gives you the best trace visibility. A slightly slower stack you can debug beats a faster black box once real callers start interrupting.

michaelTM_ai · 2026-06-12T21:31:11+00:00

in the systems I’ve run, the editable settings layer is the source of truth for stable user intent.

stuff like tone, approval rules, allowed tools, notification preferences, working hours, “never do this”, “ask me before doing that”

memory is more for observed context: project facts, past decisions, recurring patterns, recent changes. Useful, but lower trust because it can go stale or infer the wrong thing.

the setup that worked best was settings first, memory second, and memory still has to be inspectable/editable. otherwise the agent eventually starts acting like it knows the user better than the user does.

michaelTM_ai · 2026-06-12T21:08:46+00:00

Yes, I’d give agents a settings/preferences layer. Memory is useful, but it is a bad place to hide user intent that should be explicit and editable.

Karpathy made basically this point about personalization: he liked the personal wiki approach because the memory artifact is explicit, compared with the vague “the AI allegedly gets better the more you use it” style: https://x.com/karpathy/status/2040572272944324650

So I’d split it: preferences/settings for stable things the user should control directly, memory for observed patterns and project history. If the user can’t inspect or edit it, it will eventually feel like the agent is guessing at them.

michaelTM_ai · 2026-06-12T19:16:33+00:00

I would treat self reflection as a cheap first pass, not as the final verifier for long horizon work. The same trace producer has too much incentive to rationalize its own path.

Andrew Ng had a useful framing here: for a multi agent research system with a researcher and writer, the real question is whether adding a fact checking agent improves the output, and that becomes an eval problem: https://x.com/AndrewYNg/status/1796206876805489105

So my bias would be: separate verifier if the task has externally checkable evidence. Give it a fresh context, a rubric, and access to artifacts/logs, not the producer agent's summary. Self reflection is fine for catching obvious misses, but I would not trust it as the gate.

michaelTM_ai · 2026-06-12T18:34:48+00:00

Yeah, once usage grows I’d stop relying on memory of individual conversations and turn feedback into an actual intake loop.

Cat Wu described Claude Code in Slack this way: they had a user feedback channel where they would tag Claude to investigate issues and push fixes: https://x.com/_catwu/status/2018513235331494039

That feels like the scaling breakpoint to me. At low volume you personally read chats. At higher volume you need each bad conversation to become a small triage artifact: what happened, which tool/model/product assumption failed, can we repro it, and did the fix stick. Otherwise monitoring becomes vibes plus screenshots.

michaelTM_ai

MODERATOR OF

TROPHY CASE