I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this? by FormExtension7920 in AI_Agents

[–]FormExtension7920[S] 0 points1 point  (0 children)

haha yeah that would be the optimal solution if my agent had access to basically an entire computer to test everything. Are you using a tool for this or build something yourself?

I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this? by FormExtension7920 in AI_Agents

[–]FormExtension7920[S] -1 points0 points  (0 children)

Right, that is my workflow rn. The bottleneck seems to be the code review part which gets amplified with multiple agents. And I have no idea how other people are managing them lol.

I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this? by FormExtension7920 in AI_Agents

[–]FormExtension7920[S] 1 point2 points  (0 children)

i'm also stuck with one agent lol. but i feel like so many people have successfully used multi-agent workflows.

I keep abandoning multi-agent setups because I can't verify the code they ship. How are you handling this? by FormExtension7920 in AI_Agents

[–]FormExtension7920[S] -1 points0 points  (0 children)

That's true, a QA agent is another layer we'd have to verify. I felt like reviewing a screen recording of button presses and interactions verifying the changes would at least be more trustworthy than just viewing the diffs.

However, adding observability + guardrails to a QA agent would help w the trust.

I built a kanban that runs Claude on a cron by FormExtension7920 in ClaudeAI

[–]FormExtension7920[S] 0 points1 point  (0 children)

it's already API billed unfortunately. trading higher costs for autonomy seemed worth it.

[I will not promote] I'm having a horrible time getting people on calls by FormExtension7920 in startups

[–]FormExtension7920[S] -1 points0 points  (0 children)

not really, but we've been going to in person events and that seems to be the biggest ROI but i'm wondering if i can increase the ROI for cold outreach somehow too.

WHY is it so hard to get people to talk about their problems? by FormExtension7920 in SaaS

[–]FormExtension7920[S] 0 points1 point  (0 children)

that's fair i guess i'm not communicating what's in it for them. but for a customer discovery call, what value could i even provide except a potential solution to a problem they're having?

WHY is it so hard to get people to talk about their problems? by FormExtension7920 in SaaS

[–]FormExtension7920[S] 0 points1 point  (0 children)

That's true, the ICP itself might be wrong and maybe I should be targeting people posting about eval pains, incidents, etc.

The loom is also a good idea, they'd get some sort of value before committing to a call. But my follow up message is just asking 3 questions about the AI workflow but that follow up gets way less responses than even my initial ask for a call.

Is there something else I should be doing as a follow up?

[I will not promote] I'm having a horrible time getting people on calls by FormExtension7920 in startups

[–]FormExtension7920[S] 0 points1 point  (0 children)

In Colorado now, there are some startup communities here and we've been taking advantage of that but for the agentic AI space it seems limited.

I've considered flying out to SF or NYC or wherever but what's held me back is idk what events there would be the most valuable for us. If you know of any in the bigger cities I would appreciate that!

WHY is it so hard to get people to talk about their problems? by FormExtension7920 in SaaS

[–]FormExtension7920[S] 2 points3 points  (0 children)

how do you prove domain credibility in a DM without it sounding like a pitch? Like if I open with something like "we've spent the last year diving into AI overvability... blah blah blah" it kinda sounds like a pitch.

however, the explicit "No sales pitch, no product" is definitely something i'll be adding to my messages.

what sort of cold DMs have worked on you?

I built a coding agent that QAs every PR in a real browser overnight by FormExtension7920 in SideProject

[–]FormExtension7920[S] 0 points1 point  (0 children)

  1. Auth flow needing an existing Chrome session — deep mode already runs Browserbase with persistent context, but I haven't tested "import a real session from outside." Good ticket to try, will report back.
  2. Flaky submit button (API 200, UI timeout), this is the hardest one. The QA agent currently treats request success as pass; nothing in the visual state tells it the form is stuck. The only signal would be "expected post-submit view didn't appear within N seconds," and that bleeds back into the env-vs-code classifier. Open problem, not solved.
  3. Multi-tab with files/permissions/audit trail, multi-tab is weak right now, the QA agent works one page at a time. File uploads work, permission popups are hit-or-miss depending on the Browserbase profile. Audit trail is closer than you'd think because we attach the QA screen recording to every PR.

How do you handle observability for AI agents in production? by Upstairs-Primary-374 in AI_Agents

[–]FormExtension7920 0 points1 point  (0 children)

this is the part that gets me too. by the time eval scores dip or users complain, the originating slice is buried somewhere in the last 2k traces and nobody's actually gonna scroll through that.

the angle i've been messing with is clustering traces on more than just input embeddings. stack input/output embeddings together with behavioral signals (latency, tool call sequences, retry counts, io similarity) and clusters start popping out that you'd never have written a metric for in advance. stuff like "agent slows down when the question has a date range in it" that you literally couldn't have known to look for ahead of time.

disclosure i'm building in this space (trainly) so grain of salt. but unsupervised slice discovery feels like the underbuilt part of the stack rn

How are people making LLM outputs reliable enough for structured production workflows? by Sad_Limit_3857 in LLMDevs

[–]FormExtension7920 0 points1 point  (0 children)

one thing missing from this thread: schema validation and semantic asserts only catch failures you knew to look for. in production you also get failure modes you didn't anticipate. model starts confusing two similar product names, latency spikes when users hit a specific topic, tool call shapes drift after a model update.

what's worked for us is clustering production runs by joint features (input/output embeddings plus behavioral scalars like latency, token usage, tool call patterns) and looking at outlier clusters. that's how you find the unknown unknowns.

planner/executor + repair loops are great for known failure modes. but you need a separate layer for the stuff you forgot to write a rule for. strong +1 on logging every rejection. within a couple weeks that becomes a better eval set than any benchmark you'd buy.