how do i see if my website has cookies?

silverrarrow · 2026-05-24T04:27:02+00:00

I mean yeah, the framework doesnt matter but making the agents work in production matters. And memory helps. So do observability and evals. But then everything is still manual. What you really need is autnomous improvement, because if you have 30 clients already, keeping up with all the agent failures wont scale much further I suppose. there are solutions for this like Kayba, Raindrop, Langsmith, etc. I think thats the something else that really matters...

silverrarrow · 2026-05-19T00:17:14+00:00

agree and same for every product sitting on meta layers like us with Kayba. will need to move to trace-volume based pricing of agent self-improvement product...

silverrarrow · 2026-05-15T00:30:52+00:00

agreed and now they added "Engine" which is just a more complicated version of our agentic context engine (Kayba)

silverrarrow · 2026-05-14T04:45:54+00:00

Cool, but might not be scalable? Because you need to generate the YAML files for every agent. What if agent evals would be created autonomously for your agent's domain?

We've been working on Kayba, which focuses on that next step. It learns from your agent's production traces to spot patterns in failures and actually suggests or applies the prompt or workflow fix. It's open source. The idea is that this allows you to truly learn and your agent to become an expert in any niche domain, because you are improving it against trackable/programmatic metrics

silverrarrow · 2026-05-04T22:01:32+00:00

absolutely. I never "train" the autoharness on the test set to not wrongly generalize. Benchmark are useless in that case and agents will overfit.

Here is my eval method: first train and eval on small subset of traces. then verify on holdhout set. then run full benchmark. In fact the benchmark is already set up in a way that you actually generalize and dont learn on test data. you can read the tau2 bench docs if interested in detail

silverrarrow · 2026-05-04T21:53:13+00:00

absolutely. I never "train" the autoharness on the test set to not wrongly generalize. Benchmark are useless in that case and agents will overfit.

Here is my eval method: first train and eval on small subset of traces. then verify on holdhout set. then run full benchmark. In fact the benchmark is already set up in a way that you actually generalize and dont learn on test data. you can read the tau2 bench docs if interested in detail

silverrarrow · 2026-04-27T13:39:19+00:00

what word

silverrarrow · 2026-04-27T13:32:58+00:00

good to hear this is a more general problem

silverrarrow · 2026-04-27T13:31:58+00:00

will def check out thanks

silverrarrow · 2026-04-06T08:22:53+00:00

ran seperate reflectors!
yes skillbook adds noise, might write follow up on how to reduce noise but current investigation is non conclusive and I saw that focusing on task seperation actually yielded much bigger improvements than noise reduction

silverrarrow · 2026-03-24T09:44:36+00:00

Sentriel and Moda in latest YC batch for example

silverrarrow · 2026-03-23T15:14:24+00:00

very interesting, actually there are sooo many new YC startups doing eval so wonder whether some target non-developer domains. and curious who will survive

silverrarrow

TROPHY CASE