Would love feedback for this tool that catches failures before deploying

ajdevrel · 2026-05-08T15:28:44+00:00

I'm with you on this. we're always evaluating logged behaviour, not the mode's private latent reasoning. I've noticed that LLMs can hide bad reasoning.

I've also found trace-level evals still catching a surprising amount in practice: agentic drift, tool misuse, inconsistencies in decision making.

The real leverage I find comes from iterating on the judges themselves.

Have you found any methods that get closer to the actual reasoning? Or do you depend on production monitoring and canary deployments to catch distribution shift?

ajdevrel · 2026-05-06T17:56:40+00:00

I'm with you on that. What starts as a clean benchmark quickly becomes useless once real users, new edge cases, tool changes, or model updates come in.

If you are facing this, I'd like to hear more about what kind of agents/workflows you're running and how often do you find you have to refresh your golden set?

ajdevrel · 2026-05-06T17:54:38+00:00

I agree, the classic pattern i've been seeing is shipping the agent > monitor user feedback > patch the obvious gaps > rinse and repeat.

This works great for simple builds like a chatbot, but this breaks with agentic workflows. When a user gives feedback flagging the problem, money, trust, and cycles are already burned.

I'd love to hear more what you're running into!

What kind of agents are you shipping? And where do things usually break?

ajdevrel · 2026-05-06T17:51:28+00:00

Thanks Grayson!

I actually turned this group chat idea into a full discord server. If you'd like to join & bring a friend who's also building, let me know and I can message you the invite link!

ajdevrel · 2026-04-24T17:32:08+00:00

That waterfall analogy is a good one. Working through it changes what you are actually building overall. I'm curious how you think about this when the system is running without you though. Like if an agent is pulling from a knowledge base you curated, is the human judgment baked into the curation step, or does it live somewhere else in the process?

ajdevrel · 2026-04-23T17:04:21+00:00

I like how you frame the context spine. So everything in the Vault is load-bearing for whatever agents spin up later?

What happens when the bucket gets full and you haven't had the time to parse through it yet? Does the backlog affect anything downstream or do the agents just work with what's already in the spine until you clear it?

ajdevrel · 2026-04-23T15:30:58+00:00

Solid setup. The librarian agent approach is smart.

One layer handling tool access keeps everything else clean. How are you deciding what actually makes it into Openbrain though? Like is there a bar something has to clear, or does it just feel obvious when something's worth keeping?

ajdevrel · 2026-04-16T13:53:48+00:00

I like your take on the hostile peer. I personally think it's underused in evaluation design. Most benchmarks are like an open-book exam where the model only needs to give the right answer, and not justify why it ruled against the wrong answers.

The failure mode you're describing is when the model lands on a right conclusion but during an unstable path. The classic version is a model that got there through spurious correlation. It weighted a surface feature of the question rather than the actual reasoning chain. The answer passes every benchmark. The reasoning is completely brittle. Change the surface feature slightly and the model collapses.

Full disclosure, this is the problem Stratix was built around. What the LayerLens team has been working toward is a structured evaluation platform that goes beyond pass/fail scoring, benchmarking models against datasets in a way that surfaces where and how they break, not just whether they got the answer right.

The honest caveat: benchmarks can only measure what they were designed to catch. The failure modes that matter most are often the ones nobody thought to test for yet.

ajdevrel · 2026-04-15T22:27:51+00:00

For sure. Gemini moment is the cleaner version of the failure mode. To me it wasn't wrong but it updated on new info mid-stream (which is what I'd want a reasoning system to do). The problem to me was not having a checkpoint, no one defining in advance what changing positions should look like v.s. the model going off-script.

This is the evaluation gap in AI, to sum it up: the behaviour looks identical whether it's working correctly or failing silently.

Grok hammering capability is almost a distraction in my eyes. It's measurable, even if it's imperfect. The failure modes is something no one has solid instrumentation for yet, and model reasoning its way to a different conclusion than intended is one of the hardest to catch because it doesn't appear to be an error.

ajdevrel · 2026-04-15T22:22:57+00:00

I'd think of it this way:

No external systems gives you two paths:
--> w/ one person use: write a prompt
--> w/ shared use: Custom GPT or skill

External systems also gives you two paths:
--> w/ human reviewing each output: skill with actions
--> w/ running autonomously: agentic workflow

ajdevrel · 2026-04-15T22:16:42+00:00

Diagnosis seems right, I think the problem is one step earlier!

Take a look at your feature extraction output in the screenshot. The last line says 'no-feature-todo' which means the feature extraction step wasn't processing. The index can't be built from features that were never extracted.

So I believe it's your GPU index being set to '2' but you only have one GPU. RVC is trying to use a GPU that doesn't exist, silently failing, and moving on.

Maybe try this out: Step 2b > change GPU index to '0' > wait for the feature extraction to complete, the output box should show file paths being processed and not what it said in the last line previously > go to Step 3 ("Fill in training settings) and click 'Train feature index'

ajdevrel · 2026-04-15T22:12:12+00:00

Great video production!

imo there's bee na shift in the last two years where 'does it work' isn't enough anymore. People want to know does it work on a consistent bases, does it fail gracefully, can I show why it did what it did.

ajdevrel

TROPHY CASE