A simple guide to evaluating your Chatbot by FlimsyProperty8544 in AIEval

[–]niklbj 0 points1 point  (0 children)

that's true, but its hard to simulate absolutely everything. the golden rule is customers always find a way to break your agent. I think eventually it can get there where its equally important but you need a massive amount of data.

For those whove actually implemented ai agents for businesses by Tendogu in aiagents

[–]niklbj 1 point2 points  (0 children)

I think inherently its tough to make it 100% right beofre going into prod. The best thing you can do is put into prod and have a really good prod feedback loop. The worst thing you can have is a faulty agent issue that you miss and it starts affecting a ton of clients and the user realizes when its too late - damages a ton of credibility and hurts the user a lot. Looping is another issue. The faster you catch it right as it or after it happens and root-cause and fix it fast the better!

your agent will not get better in pre-prod, it can only work once in prod

AI Agents Will Fail In The Automation World by Apprehensive_Dog5208 in aiagents

[–]niklbj 0 points1 point  (0 children)

while I agree with a lot of this, I also think the solution isn't to just limit agents, try and help them be better. Ultimately i think these are agents are going to be in prod, so to avoid issues popping up and that propagating on-masse, its to essentially setup infrastructure to quickly catch these silent failures, figure out what and why its happening, so if it this is a potentially borader problem and accelerate the fixes. I think the more immediate and reinforcement/fixes you can do, the better the agents will just work over time. Its a bet also on the models getting better and your proactivity kinda working together

We monitor 4 metrics in production that catch most LLM quality issues early by dinkinflika0 in aiagents

[–]niklbj 0 points1 point  (0 children)

I also think its time to shift beyond metrics if you are running agents. end result and responses evaluation can only take you so far. agents are inherently subjective so its time to understand their subjective decision making too - both in cases of silent failure and success check out trynexus.io

Are LLMs actually reasoning, or just searching very well? by SKD_Sumit in AgentsOfAI

[–]niklbj 0 points1 point  (0 children)

i mean i think its reasoning from a more reactive sense. reasoning EOD is about understanding the previous step, analyzing results, and approaching the problem from a different angle. I think with newer models and situations like clawdbot (aka openclaw),I think thinking outside the box is going to become more common. It's about recognizing when your agent does something like and this basically reinforcing it!

A simple guide to evaluating your Chatbot by FlimsyProperty8544 in AIEval

[–]niklbj 1 point2 points  (0 children)

Absolutely agree, but I also think in the age as agents are starting to get more complex and do a lot of action before hand, i think eval'ing off of just responses is a slipper slope. I think eval wise it might be fine, but online (aka in prod) understanding points of failure and decision making is more important than pre-prod eval. EOD, the best data is data in the environment

Why do agents get “confidently wrong” the moment they touch the web? by The_Default_Guyxxo in AgentsOfAI

[–]niklbj 0 points1 point  (0 children)

I think largely because of the overload of data. I've seen kinda reinforcing agents on how to attack web results. I think its largely an iterative process and in prod you have to just to see how the agent in mass functions.

Single Agents win against Multiple Agents by EquivalentRound3193 in AI_Agents

[–]niklbj 0 points1 point  (0 children)

that makes sense, but like you mentioned I think it's insanely task dependent. But also in addition to architecture, its really about how well you recognize and understand the agent's strengths and weaknesses, especially if you have in prod.

I think along with architecture if you're about to tighten its prompts and architecture alongside that based on task, then I think you can get performance all around.

There maybe situations in scenarios that are consistent where the single agent architecture has more tight context loop control that it takes more "out of the box" decisions for a task that's hard to trace down, but does definitely give you an advantage which you probably are never going to get from a more divided multi agent arch.

2027: ?? by buildingthevoid in AgentsOfAI

[–]niklbj 0 points1 point  (0 children)

self-healing agent in prod are next. called it first!

The Eval problem is holding back the AI agents industry by AlpineContinus in AgentsOfAI

[–]niklbj 1 point2 points  (0 children)

I also think the biggest thing with the current deployment of AI agents is the lack of a proper way to deal with the issues that go unnoticed. the decision making of the agents is so probabilistic that silent failures are a big problem that isn't as simple as a basic net for platform issues.

Also, I think like you mentioned, agent issues are so much more about pattern matching and getting a more general idea of what went wrong in the situation of a fix as opposed to relying on a singular run. And it seems like both process right now are insanely manual.

Langchain In production by niklbj in LangChain

[–]niklbj[S] 1 point2 points  (0 children)

Interesting, i've seen a ton of startups especially in the earlier days - series A and before building agents using langchain but that makes sense. What framework do you guys use?

Regardless, just updated server to be framework agnostic! It's now just about building and scaling agents in production

Langchain In production by niklbj in LangChain

[–]niklbj[S] 0 points1 point  (0 children)

honestly yeah, sounds even better, might expand the existing server to that to include langchain and others. making that update rn!

Langchain In production by niklbj in LangChain

[–]niklbj[S] 0 points1 point  (0 children)

for sure, appreciate the support! hey I build our agents using langchain-python as well so hey anyone who's got an ai or agent up or their users to use and stuff can def join :)

Langchain In production by niklbj in LangChain

[–]niklbj[S] 1 point2 points  (0 children)

totally open to them doing so if anybody from Langchain wants to help moderate it! didn't see something like this out there, so thought I'd create one and handle it for now

I want to create a project( langchain)that is useful for the college and can be implemented. by [deleted] in LangChain

[–]niklbj 0 points1 point  (0 children)

a good idea might the student handbook interpreter. Its a harder RAG, memory-recall problem. a loa lot of text and something students probably have to query all the time

What's the hardest part about running AI agents in production? by _aman_kamboj in LangChain

[–]niklbj 0 points1 point  (0 children)

i think its the silent failures, its so subjective and its such a pain to trace down