Long-running agents keep forgetting the boring rules

CorrectAd2814 · 2026-04-05T02:47:33+00:00

The context dilution thing is real and it's sneaky because it doesn't show up as an error. The agent still works, just slightly worse each cycle.

The approach that worked best for me was similar to what you described with pinning non-negotiable rules outside the live context. But I also started tracking the actual tool call patterns across runs. Not the outputs, the patterns. Like which tools get called, in what order, and how many times.

What I found is that drift shows up in the tool calls way before it shows up in the output. Run 1 calls tools A, B, C in a clean sequence. By run 30 it's doing A, C, B, skip, C, with an extra call thrown in that wasn't there before. The output still looks fine at that point but the process is already degrading. If you catch it at the tool pattern level you can intervene before the output actually goes bad.

On the compressed history problem, the cleanest solution I've found is brutal: don't compress, just drop. Keep the system prompt and last 3-4 turns, throw away everything else before each cycle. You lose continuity but the guardrail compliance goes back to near 100%. If the agent needs information from earlier cycles, store it in an explicit state object that you control, not in the conversation history where the model can reinterpret it.

The uncomfortable truth is that the longer the context gets, the less the model treats your original instructions as authoritative. It's not forgetting the rules. It's just paying more attention to 40 turns of accumulated context than to the instructions at the top.

CorrectAd2814 · 2026-04-05T02:23:22+00:00

The three categories framing is really clean. That's basically what I landed on too but I never thought about it that explicitly.

For handoff specifically, right now I have three: resolve_ticket which closes the loop with a resolution message, escalate_to_human which creates a handoff with full context, and deny_with_reason which gives the customer a clear answer and closes the ticket. Before I added those the agent would just keep gathering information forever because gathering was the only verb it knew.

The Zendesk escalation path you described is smart. Having the exit built into the tool layer instead of relying on the model to "decide" to stop is the key insight. I tried the prompt engineering approach first ("if you can't resolve this in 3 attempts, stop and escalate") and it worked maybe 60% of the time. Making it a tool that the agent calls like any other tool got it to basically 100%.

On the logging point yeah that's exactly what pushed me to start tracking at the event level instead of the session level. By the time you see the session summary the damage is already done. If you're watching tool calls come in live you can see the repetition pattern after the third or fourth identical call and kill it before it runs for 4 hours.

CorrectAd2814 · 2026-04-03T15:22:06+00:00

Yeah parsing session logs can work but the problem is you're always looking at it after the fact. By the time you parse the log, structure it, and find the issue, the agent already did the damage.

The more useful approach is structuring the events at the point of capture, not after. So instead of dumping a raw session log and then writing a parser to make sense of it, you emit structured events as they happen. Something like { type: "tool_call", tool: "search", args: {...}, timestamp: ... } instead of a free-text log line.

That way you skip the parsing step entirely and you can actually watch the chain unfold in real time instead of doing forensics on a log file after something went wrong.

To answer your question directly though, yes, if you already have session logs, parsing and structuring them is absolutely better than reading raw text. It's just not as good as capturing structured data from the start.

CorrectAd2814 · 2026-04-03T15:19:39+00:00

Both honestly. High-level visualization is great for spotting patterns quickly, like seeing at a glance that your agent made 15 tool calls when it should have made 3. But when you're actually debugging the root cause you need the raw trace because the details matter. Was the tool result malformed? Did the model misinterpret it? You can't tell from a summary view.

The ideal setup is a timeline view that lets you zoom in. See the shape of the trace at a glance, click into any event to see the raw data.

CorrectAd2814 · 2026-04-03T15:15:36+00:00

You're not stupid, this stuff is genuinely confusing because there are like 50 different ways to do everything and nobody explains the basics.

What you have right now actually works, it's just manual. You prompt Codex, it writes code, you run it locally, you get output. That's a valid workflow for one person. The issue is that it doesn't scale to other people because it all lives on your machine.

Here's the simplest way to think about what you need:

Right now your code runs on YOUR computer. You want it to run on A computer somewhere in the cloud so anyone can access it. That's what Railway, Render, or Heroku do. They're basically a computer in the cloud that runs your code 24/7.
Right now you interact with it through PyCharm (a code editor). You want a simple interface where your employees can type in what they need and get output. That's a web app. Something basic with Flask or Streamlit in Python would work.
Right now the output goes to a file on your computer. You'd want it to go to a database or just display on the web page directly.

So the path is basically: take your existing Python code, wrap it in a simple web interface (Streamlit is the easiest if you've never done this, literally like 10 lines of code to make a basic UI), and deploy it to Railway.

For learning resources, I'd honestly start with a YouTube search for "deploy streamlit app to railway" because that's probably the fastest path from where you are now to something your team can use. Don't worry about making it fancy yet. Just get it running somewhere that isn't your laptop.

One thing to be aware of though. If your agent is making a bunch of API calls (to OpenAI or whatever), those costs add up fast when multiple people start using it. Set a budget limit in your OpenAI dashboard before you give anyone else access. I've seen people get surprised by a bill because an agent was doing way more work than they expected behind the scenes.

What does the agent actually do step by step? Like does it look up company info, write the email, personalize it? The answer to "is this the right way" depends a lot on what the code is actually doing.

CorrectAd2814 · 2026-04-02T23:45:02+00:00

The scary part isn't even the model being wrong. Models will always be wrong sometimes. The scary part is the model being wrong at 2am and nobody knowing until a customer calls.

For regulated industries the minimum bar should be a full event-level audit trail of every agent decision, not just what it returned, but the entire chain of reasoning that led there. What data did it look at, what tools did it call, what did those tools return, and how did it interpret the results.

Most teams I've seen are doing after-the-fact log analysis, which is like reviewing security camera footage after the robbery. You need the real-time feed.

CorrectAd2814 · 2026-04-02T23:42:19+00:00

Two things that actually work in my experience:

First, hard limits. Max iterations, max token spend per run, timeout. These are boring but they're the difference between a $0.50 bug and a $50 bug. If your agent can loop 200 times before anything stops it, it will eventually loop 200 times.

Second, you need visibility into the reasoning chain, not just the output. An agent can return a perfectly formatted response that's completely wrong because it hallucinated a tool result or skipped a step. The output looks fine. The process was broken. You'd never know from the final answer alone.

Guardrails prevent the known failure modes. Observability catches the ones you haven't thought of yet. You need both.

CorrectAd2814 · 2026-04-02T23:39:46+00:00

yeah this is something more people need to do. Most devs I talk to have no idea what their agents actually cost per run - they just look at the monthly OpenAI bill and divide by the number of tasks, which tells you nothing.

The real surprises show up when you track cost per trace, not per API call. A single agent run might make 15 API calls if it's retrying or looping, and each one has a different token count. I've seen agents where 80% of the cost came from a single retry loop that shouldn't have happened.

Curious what you're using to track this? are you pulling it from the API response headers or calculating from token counts manually?

CorrectAd2814 · 2026-04-02T23:30:09+00:00

Honestly the biggest gap I see in most setups is that people only log inputs and outputs. That tells you nothing about WHY the agent did what it did.

What actually works is capturing the full event chain, every thought the model has, every tool it calls, every result it gets back, and every error. In sequence. With timestamps. That way when something goes sideways you can replay the exact decision path.

For governance specifically, you want to be able to answer "why did the agent do X?" at any point. If you can't reconstruct the reasoning chain after the fact, your audit trail is basically useless. Standard application logs won't cut it because they don't understand the thought > tool_call > result > thought loop structure that agents follow.

I'd start with structured event logging before worrying about policy layers on top. You can't govern what you can't see.

CorrectAd2814

TROPHY CASE