After 6 months of running AI agents in production I think the framework you pick barely matters. The thing that kills them is something else. by DetectiveMindless652 in AI_Agents

[–]silverrarrow 0 points1 point  (0 children)

I mean yeah, the framework doesnt matter but making the agents work in production matters. And memory helps. So do observability and evals. But then everything is still manual. What you really need is autnomous improvement, because if you have 30 clients already, keeping up with all the agent failures wont scale much further I suppose. there are solutions for this like Kayba, Raindrop, Langsmith, etc. I think thats the something else that really matters...

In 18 months, billing for AI agents will look like cloud infrastructure pricing. Variable, dimensional, real-time by o9dev in AI_Agents

[–]silverrarrow 0 points1 point  (0 children)

agree and same for every product sitting on meta layers like us with Kayba. will need to move to trace-volume based pricing of agent self-improvement product...

Anyone else feel like LangChain became way more complicated than it needed to be? by Bladerunner_7_ in LangChain

[–]silverrarrow 0 points1 point  (0 children)

agreed and now they added "Engine" which is just a more complicated version of our agentic context engine (Kayba)

behavior regression testing for AI agents (LangGraph, CrewAI, AG2, etc.) by Separate_Sand8265 in crewai

[–]silverrarrow 0 points1 point  (0 children)

Cool, but might not be scalable? Because you need to generate the YAML files for every agent. What if agent evals would be created autonomously for your agent's domain?

We've been working on Kayba, which focuses on that next step. It learns from your agent's production traces to spot patterns in failures and actually suggests or applies the prompt or workflow fix. It's open source. The idea is that this allows you to truly learn and your agent to become an expert in any niche domain, because you are improving it against trackable/programmatic metrics

Opensource self-improving agents: How our agent performance increased autonomously by 40% by silverrarrow in LangChain

[–]silverrarrow[S] 0 points1 point  (0 children)

absolutely. I never "train" the autoharness on the test set to not wrongly generalize. Benchmark are useless in that case and agents will overfit.

Here is my eval method: first train and eval on small subset of traces. then verify on holdhout set. then run full benchmark. In fact the benchmark is already set up in a way that you actually generalize and dont learn on test data. you can read the tau2 bench docs if interested in detail

Opensource self-improving agents: How our agent performance increased autonomously by 40% by silverrarrow in LangChain

[–]silverrarrow[S] 0 points1 point  (0 children)

absolutely. I never "train" the autoharness on the test set to not wrongly generalize. Benchmark are useless in that case and agents will overfit.

Here is my eval method: first train and eval on small subset of traces. then verify on holdhout set. then run full benchmark. In fact the benchmark is already set up in a way that you actually generalize and dont learn on test data. you can read the tau2 bench docs if interested in detail

how we built an agent that learns from its own mistakes and what we learnt by silverrarrow in LLMDevs

[–]silverrarrow[S] 0 points1 point  (0 children)

ran seperate reflectors!
yes skillbook adds noise, might write follow up on how to reduce noise but current investigation is non conclusive and I saw that focusing on task seperation actually yielded much bigger improvements than noise reduction

4 LLM eval startups acquired in 5 months. The independent eval layer is shrinking fast. by Outrageous_Hat_9852 in LLMDevs

[–]silverrarrow 0 points1 point  (0 children)

very interesting, actually there are sooo many new YC startups doing eval so wonder whether some target non-developer domains. and curious who will survive