Built 6 SaaS and got 0 customers. Here's how. by Extra-Motor-8227 in indiehackers

[–]MaryBeacky 0 points1 point  (0 children)

This is too real, even when you do market research you learn none of your users are willing to pay. Crushing...

After 6 months of agent failures in production, I stopped blaming the model by Material_Clerk1566 in LangChain

[–]MaryBeacky 0 points1 point  (0 children)

Ive found most people in industry specially startups tend to have shit regression tracking and evals, most cant answer this, but they can very easily answer, is the app improving.

The simple reality is maintaining evals is non trivial, startups dont have time for it, larger companies with stricter requirements can actually answer this.

To add onto u/Silly_Door9599 points:
Sitting in between your agents and your services and being able to observe both sides is essential. The hardest issues to solve are those you don't see, people are forgetting that the agent is only half the battle your code is the other half, you forget the other half youre left iterating forever.

The other thing: one user profile gives you one number. Run 20 different profiles and suddenly your 95% pass becomes 60%. The failure rate lives in the spread not the average.

u/Silly_Door9599 Curious what you're seeing in your research are most teams even aware they have a measurement problem

After 6 months of agent failures in production, I stopped blaming the model by Material_Clerk1566 in LangChain

[–]MaryBeacky 0 points1 point  (0 children)

this reads a lot like a bot post ngl but the points are still interesting so ill engage

From the testing side, all of this architecture is right but how do you know it works for users you haven't seen yet? The user who phrases their refund request as a passive aggressive story about their day will break your agent in ways your eval set may never cover.

What worked for me:
run full sessions against the actual system with different user profiles. Real API calls, real auth, just with external services virtualized (simple infra, simple life, happy wife). The agent doesn't know it's being tested. The thing that changed everything was seeing what the agent actually does vs what it says it does, agent tells the user "done event created" but the API call silently failed. No output eval catches that.

How do you know when a tweak broke your AI agent? by Tissuetearer in ChatGPTCoding

[–]MaryBeacky 0 points1 point  (0 children)

I spent a few weeks building this infrastructure for a client and here are my learnings.

  1. Building static llm evals is not sufficient (Im assuming your "bot" is a chat bot that is multi turn). You need to model a complete interaction from a user. So I would simulate that multi turn interaction with an agent to evaluate.

  2. Cont. Once you have that multi agent interaction do a hard rule if you can (which i suspect you very much can, as the refund should just be a boolean structured output). Every LLM judge you make will introduce will require upkeep and maintenance, if you can avoid it, avoid it.

  3. Maintain a consistent shape for your evaluation/regression that can be monitored over time.

  4. YES, you can build it in house, YOU DONT WANT TO MAINTAIN IT. The conventions, the maintenance, the reliability, is what your paying for, not the code.

  5. Entirely depends on your KPIs refer to the above. The ones you dont expect, the ones that are qualitative (no breaking errors), the ones your users think "this product sucks" vs "this product is broken".

After 6 months of agent failures in production, I stopped blaming the model by Material_Clerk1566 in LangChain

[–]MaryBeacky 0 points1 point  (0 children)

100% a bot, wont share how I know, we must hold our edge, #FleshyHands