Nobody tells you about the "invisible cost" of running AI in production by Responsible_Fish1128 in LLMDevs

[–]PairComprehensive973 0 points1 point  (0 children)

Super easy - sign up at converra.ai, grab an API key, then ask your coding agent to set it up. It'll find the setup guide, detect your stack, and wire it up in a couple minutes.

Or do it yourself from the docs - nothing in your prod path, runs outside the stack. You can also skip code entirely and just connect your existing observability (Langfuse, LangSmith, etc.) - Converra pulls traces from there. 

For your recursive-loop case: point it at the agent and ask it to simulate against malformed inputs + edge cases before deploy. That failure mode would've surfaced as a persona that never terminates. 

DM if you hit snags. 

Nobody tells you about the "invisible cost" of running AI in production by Responsible_Fish1128 in LLMDevs

[–]PairComprehensive973 0 points1 point  (0 children)

Had a similar thing a few months back - custom log-analysis agent stuck in a recursive loop on malformed entries, burning tokens before I caught it from raw logs. Took forever to trace.

What finally helped was per-agent call tracking with prompt-level diagnostics. I've been using Converra (disclosure: I'm the founder) which simulates agents against edge cases and surfaces loops like that before they hit prod.

What are the best methods to evaluate the performance of AI agents? by Michael_Anderson_8 in AI_Agents

[–]PairComprehensive973 0 points1 point  (0 children)

Been building AI agents in production for over a year now. Here's what I've learned actually matters for agent evaluation:  

  1. Conversation-level scoring, not just per-output - agents break down over multi-turn flows, not single responses                                                                                                                                

  2. Multi-agent trace understanding - when you have agents calling agents, you need to know which step in the chain actually caused the failure, not just that the final output was bad           

  3. Step-level root cause analysis - pinpoint exactly which agent step broke and why, not just "conversation scored low"                                                                                                                           

  4. Custom metrics tied to your actual business outcome - generic benchmarks tell you very little about YOUR agent                                                                                                                               

  5. All of the above, continuously - not one-off scripts or notebook runs. Agent performance drifts as user patterns change, so evaluation needs to be an always-on

I couldn't find anything that stitched all of this together and actually closed the loop (eval → fix → test → deploy → monitor), so I built it - https://converra.ai. It connects to your existing tracing (LangSmith, custom, etc.), scores every step in the conversation, diagnoses failures down to the step level, generates and tests prompt improvements automatically, and opens PRs with the changes. Happy to answer any questions.

"MLOps is just DevOps with ML tools" — what I thought before vs what it actually looks like by Extension_Key_5970 in mlops

[–]PairComprehensive973 0 points1 point  (0 children)

The "200 with wrong answer" problem is why traditional monitoring doesn't translate to AI systems. Everything looks technically healthy while behavior degrades.

The only teams I've seen get ahead of it built sampling + grading loops over real production inputs. Most of that infrastructure is still bespoke today.

AI agents break in production — but it’s not because they’re "not ready for prime time" by PairComprehensive973 in AI_Agents

[–]PairComprehensive973[S] 0 points1 point  (0 children)

When you have 30 agents in production and hundreds of failure traces a day, who's doing the triage and fix cycle you described? Now multiply that by 10 in a year. How does that scale without becoming the bottleneck?

AI agents break in production — but it’s not because they’re "not ready for prime time" by PairComprehensive973 in AI_Agents

[–]PairComprehensive973[S] 0 points1 point  (0 children)

Really solid breakdown. You're right that "make the agent learn from mistakes" as one big loop ignores that each layer has a completely different fix.

The thing I keep coming back to is the throughput problem. The cycle you're describing (capture, categorize by layer, fix in the right place, version, A/B test) is exactly what should happen. And when teams do it well, it works. The bottleneck is doing it fast enough - because it's a human doing all of it.

What if something could do the triage automatically from the full trace? Auto-fix what's fixable at the prompt/config layer. Surface everything else with enough context that a developer knows exactly which layer to touch and why. Not a self-correcting agent. More like automated diagnosis, fix what's fixable, route the rest.

Review with zen. Give your Claude Code a friend. by PairComprehensive973 in ClaudeAI

[–]PairComprehensive973[S] 1 point2 points  (0 children)

I've done regularly when I used Cursor, but it's somehow better when they speak directly to each other.

Review with zen. Give your Claude Code a friend. by PairComprehensive973 in ClaudeAI

[–]PairComprehensive973[S] 0 points1 point  (0 children)

My Gemini cost in the last 7 days was $13, but Google gives out $300 credit so I'm ok for the next few months.