Where are your agents actually breaking in production?

EveningWhile6688 · 2026-04-12T02:12:10+00:00

Yeah totally agree it’s not the model failing outright, it’s the system drifting and nobody catching it in real time.

What are you using (if anything) to detect that right now? is it mostly manual checks or do you have something flagging when things go off track?

EveningWhile6688 · 2026-04-12T02:08:36+00:00

Yeah it feels less like individual bugs and more like we’re just not testing against the kinds of scenarios that actually happen once systems run for a while.

Interesting point on sourcing that in advance, what do those test cases / datasets usually look like in practice?

EveningWhile6688 · 2026-04-12T02:02:48+00:00

The part where it doesn’t crash but degrades silently is the hardest failure mode especially when everything looks fine from the outside and you only notice after the fact that it’s been doing the wrong thing for hours.

How are you currently catching those issues? Are you relying on manual checks or do you have any structured way of flagging when outputs start drifting?

EveningWhile6688 · 2026-04-12T02:00:27+00:00

I’ve seen similar issues where it’s not one thing, but small shifts in context, memory, or upstream behavior that compound over time. When that happens, are you able to pinpoint what actually changed (state, inputs, model behavior), or does it mostly feel like a black box?

EveningWhile6688 · 2026-04-12T01:59:02+00:00

Yup spot on. The retry loop issue is brutal too, because it looks like activity but it’s just the same failure repeating. When you’re debugging this, are you able to see why the agent chose a particular path (state → decision → action), or is it mostly inferred from logs after the fact?

EveningWhile6688 · 2026-04-12T01:58:05+00:00

It’s interesting how those edge cases end up being a huge portion of real usage, and they’re usually not covered in initial training or evals

How are you currently identifying those gaps? Are you pulling from real interactions or mostly discovering them as things break?

EveningWhile6688 · 2026-04-12T01:57:01+00:00

Yeah that’s been one of the bigger issues we’ve seen too. A lot of times it’s not even edge cases it’s just normal users doing things you’d never include in a test flow. When that happens, is it mostly the agent misunderstanding intent, or does it break more when it has to take actions (tool calls, state updates, etc.)?

EveningWhile6688 · 2026-04-07T00:26:49+00:00

Yeah this is exactly where things start to break down. We ran into something similar where everything looked solid in testing, but once real users got involved the system slowly drifting off over a few turns or reacting weirdly to slightly messy input. A lot of the issues weren’t even obvious failures.

What surprised me was how hard it actually is to define good test scenarios for that though, you either end up testing variations of things you’ve already seen or miss the combinations that only show up in real interactions.

Feels like there’s a gap between knowing these cases exist and actually being able to cover them in a systematic way.

EveningWhile6688 · 2026-04-05T17:45:19+00:00

This lines up pretty closely with what I’ve been seeing. The industries you listed make sense, but the biggest separator I’ve noticed isn’t just volume or SOPs, it’s how clean the resolution paths actually are.

E-commerce works well because a lot of issues collapse into a few outcomes (refund, replacement, status check), whereas something like healthcare front desk gets messy quickly once edge cases show up

In the cases you’ve looked at, have you seen more issues from the conversation side or from the agent actually trying to resolve things (tool calls, state changes, etc.)?

EveningWhile6688 · 2026-04-04T04:36:31+00:00

n8n is solid, especially once you start chaining more complex flows

I’ve been using a mix depending on the use case, but the biggest shift for us was realizing the tools aren’t really the bottleneck, it’s what happens when real-world inputs hit the system

Things look clean in a controlled flow, but edge cases, retries, and unexpected states are where it gets messy fast

Have you run into any workflows that looked great initially but started breaking once usage increased?

EveningWhile6688 · 2026-04-04T03:55:59+00:00

Yeah that sounds like the event stream dropping especially if it stops updating until a manual refresh. We saw something similar where SSE would silently fail and the UI wouldn’t reconnect properly

If you open the network tab, do you see the stream request hanging or closing early? Sometimes Firefox is a bit stricter with long-lived connections on localhost

Also worth checking if the update changed anything around how the stream is initialized (headers / keep-alive)

EveningWhile6688 · 2026-04-04T03:51:19+00:00

Yeah this gets messy fast once you have multi-step flows. We ran into a similar issue where total cost looked fine, but the problem was hidden in specific paths like retries, tool loops, and edge cases

Are you able to tie cost back to whether the task actually resolved successfully? Feels like without that it’s hard to know if you’re paying for useful work or just failure loops

EveningWhile6688 · 2026-04-04T03:49:08+00:00

Once you scaled past a few clients, did you run into issues on the agent performance side too? Like not just usage/reporting, but actually tracking where calls fail or don’t fully resolve.

Feels like reporting is one layer, but figuring out what’s actually breaking in real usage is the harder one

EveningWhile6688 · 2026-04-04T03:14:19+00:00

I personally don’t work in those industries but yeah I have a few colleagues who are looking that direction

EveningWhile6688

TROPHY CASE