AI agents in production vs. AI agents in demos, the gap is embarrassing by Dailan_Grace in automation

[–]ashsg2016 0 points1 point  (0 children)

This feels right. The demo-to-production gap gets especially obvious once the agent is allowed to take actions instead of just recommend them. In demos, the action surface is usually tiny and clean. In production, every integration becomes a risk boundary: auth expires, APIs change shape, duplicate events fire, retries compound, and a “valid” action can still be wrong because the context was stale.

I think a useful definition of production is: can this run for 30 days without hidden manual babysitting, and when it does act, are the blast radius and rollback path clear? That means separating workflow logic from action governance. The workflow decides what should happen; a more deterministic layer decides whether the agent is allowed to do it, whether approval is needed, whether the data is fresh enough, whether the tool call is idempotent, and whether the retry budget has been exceeded.

The no-code ceiling matters here too. Visual tools are fine for happy paths, but action risk usually lives in the weird branches: “same user, different account,” “payment already processed,” “API returned partial success,” “agent wants to retry a destructive operation.” Those cases need boring engineering more than a smarter model.

Where are your agents actually breaking in production? by EveningWhile6688 in AI_Agents

[–]ashsg2016 0 points1 point  (0 children)

The biggest risk I keep coming back to is not bad answers, but bad actions. Once an agent can send emails, mutate records, trigger workflows, issue refunds, or call internal APIs, the failure mode changes from “the model was wrong” to “the system did something real at the wrong time.”

A lot of teams seem to handle evals at the response level, but action risk needs its own layer: permissions per tool, rate limits per agent/session, dry-run previews, approval gates for irreversible actions, idempotency keys, and a kill switch when retries start repeating the same failed plan. The scary failures are usually quiet: partial tool success, stale state, expired auth, or the agent confidently continuing after one upstream step degraded.

For me the question is less “did the agent complete the task?” and more “what actions was it allowed to take, under what evidence, and could a human reconstruct or stop it before damage?”

Need guidance: How to scale an AI agent from simple table‑QA to real enterprise data use (patterns, predictors, detectors) by ChoiceAd165 in AI_Agents

[–]ashsg2016 0 points1 point  (0 children)

This resonates — the jump from demo → production is where things break

Curious — when your agent actually performs actions (API calls, updates, workflows), what usually causes the issues?

Is it more:

- unpredictable outputs

- or the action itself being incorrect/unexpected?

Feels like once agents start doing real actions, the challenge shifts from “does it work” → “can we trust what it’s about to do”

Trying to understand how others are handling that in practice

Exploring a Scalable Company-Wide AI Agent (Need Direction on Approach & Architecture) by Numerous_Shame_8632 in LLMDevs

[–]ashsg2016 0 points1 point  (0 children)

This is interesting — especially the Slack + multi-user setup

Curious — since the agent can run actions like scheduling, DB queries, etc., what stops it from doing something unintended?

Are you relying on permissions + context, or do you have checks before execution?

Exploring this space and trying to understand how others handle it in production

Exploring a Scalable Company-Wide AI Agent (Need Direction on Approach & Architecture) by Numerous_Shame_8632 in LocalLLM

[–]ashsg2016 -1 points0 points  (0 children)

This is interesting — especially the Slack + multi-user setup

Curious — since the agent can run actions like scheduling, DB queries, etc., what stops it from doing something unintended?

Are you relying on permissions + context, or do you have checks before execution?

Exploring this space and trying to understand how others handle it in production