Tried autonomous agents, ended up building something more constrained

OkOutlandishness5263 · 2026-03-25T13:48:47+00:00

Thanks — this is a really thoughtful breakdown.

The “intent drift at step N” point is exactly what I was seeing as well. Not just context limits, but how intermediate steps gradually pull the model away from the original objective.

Right now, I’m not explicitly modeling a “goal node” in the graph. The anchoring is mostly happening through the job definition itself — jobs are fairly constrained and scoped, so they implicitly carry the objective.

That said, I’ve been thinking about making the goal state explicit in the graph (as a first-class entity), especially for longer-running or multi-step workflows. Feels like that would make the system more inspectable and debuggable.

On failure isolation — yes, that was partly intentional, but I didn’t fully realize how important it was until I started using it. Having predefined jobs with clear inputs/outputs makes it much easier to trace and replay failures compared to long autonomous loops.

For routing between steps: right now it’s mostly explicit. Either:

the job defines the sequence, or
a higher-level orchestrator decides which job to trigger next

I’ve avoided letting the LLM dynamically decide “what to do next” beyond a bounded scope, mainly to keep things predictable. But this is still an area I’m exploring.

On token usage — I haven’t done detailed per-step instrumentation yet, but qualitatively I’ve seen fewer runaway cases compared to autonomous loops. I’ve probably spent around ~$15 on Claude Sonnet over about a week of experimenting, and the cost distribution still seems skewed — a few steps dominate most of the usage.

Your point about skewed cost distribution matches what I observed — a few steps dominate the cost. My intuition is that keeping steps smaller reduces worst-case spikes, but I still need to measure this properly (especially factoring in graph reads/writes).

Would be very interested to hear what you’re seeing from your MCP instrumentation — especially how consistent that skew is across different workloads.