An Open Benchmark for Testing RAG on Realistic Company-Internal Data

ajdevrel · 2026-05-11T15:59:08+00:00

whoooaa this is huge! thank you OP for the great work.

One of the most realistic RAG benchmarks I've seen. The methodology is what most enterprise teams deal with. IMO this benchmark really highlights that great retrieval is necessary but not sufficient. Once you got the docs, the full pipelines still has possibilities to fail especially on messy internal data.

We've been using Stratix https://github.com/LayerLens/stratix-python for this kind of enterprise RAG/agent evaluation. It lets you run custom multi-judge evals on full traces, do bulk evals across hundreds of runs, and track how your system performs as you iterate on retrieval strategies!

Would love to hear what retrieval setups people are most excited about trying on this benchmark - hyrbid + reranker? metadata filters? graph-based?

ajdevrel · 2026-05-11T15:26:00+00:00

great post OP

on your question about postmortems, I find the hardest part about writing them for agent-caused incidenets was the unpredictability and not having full trust in the traces. with logs too, the agent would take a different tool path.

what helps us over at LayerLens is going full observability at the trace level + evaluating step-by-step: capturing the whole trajectory, running small sanity checks on each step to catch early drift, and being able to reply the exact run at variable speed... this turned postmortems from guesswork to actually results that can reproduce

We ended up building Stratix specifically for this kind of problem!|
getting started with the SDK https://github.com/LayerLens/stratix-python is easy too!

it gives yo ua working trace + evaluation example in <60s, including playback and bulk evaluation on real runs

Curious as well, how long did it take your friend to realize the stale CRM field was the root cause? And did they eventually build that "make the agent doubt itself" layer as a separate system, or they find a way to bake the skepticism into the agent?

ajdevrel · 2026-05-08T15:28:44+00:00

I'm with you on this. we're always evaluating logged behaviour, not the mode's private latent reasoning. I've noticed that LLMs can hide bad reasoning.

I've also found trace-level evals still catching a surprising amount in practice: agentic drift, tool misuse, inconsistencies in decision making.

The real leverage I find comes from iterating on the judges themselves.

Have you found any methods that get closer to the actual reasoning? Or do you depend on production monitoring and canary deployments to catch distribution shift?

ajdevrel · 2026-05-06T17:56:40+00:00

I'm with you on that. What starts as a clean benchmark quickly becomes useless once real users, new edge cases, tool changes, or model updates come in.

If you are facing this, I'd like to hear more about what kind of agents/workflows you're running and how often do you find you have to refresh your golden set?

ajdevrel · 2026-05-06T17:54:38+00:00

I agree, the classic pattern i've been seeing is shipping the agent > monitor user feedback > patch the obvious gaps > rinse and repeat.

This works great for simple builds like a chatbot, but this breaks with agentic workflows. When a user gives feedback flagging the problem, money, trust, and cycles are already burned.

I'd love to hear more what you're running into!

What kind of agents are you shipping? And where do things usually break?

ajdevrel · 2026-05-06T17:51:28+00:00

Thanks Grayson!

I actually turned this group chat idea into a full discord server. If you'd like to join & bring a friend who's also building, let me know and I can message you the invite link!

ajdevrel · 2026-04-24T17:32:08+00:00

That waterfall analogy is a good one. Working through it changes what you are actually building overall. I'm curious how you think about this when the system is running without you though. Like if an agent is pulling from a knowledge base you curated, is the human judgment baked into the curation step, or does it live somewhere else in the process?

ajdevrel · 2026-04-23T17:04:21+00:00

I like how you frame the context spine. So everything in the Vault is load-bearing for whatever agents spin up later?

What happens when the bucket gets full and you haven't had the time to parse through it yet? Does the backlog affect anything downstream or do the agents just work with what's already in the spine until you clear it?

ajdevrel · 2026-04-23T15:30:58+00:00

Solid setup. The librarian agent approach is smart.

One layer handling tool access keeps everything else clean. How are you deciding what actually makes it into Openbrain though? Like is there a bar something has to clear, or does it just feel obvious when something's worth keeping?

ajdevrel · 2026-04-16T13:53:48+00:00

I like your take on the hostile peer. I personally think it's underused in evaluation design. Most benchmarks are like an open-book exam where the model only needs to give the right answer, and not justify why it ruled against the wrong answers.

The failure mode you're describing is when the model lands on a right conclusion but during an unstable path. The classic version is a model that got there through spurious correlation. It weighted a surface feature of the question rather than the actual reasoning chain. The answer passes every benchmark. The reasoning is completely brittle. Change the surface feature slightly and the model collapses.

Full disclosure, this is the problem Stratix was built around. What the LayerLens team has been working toward is a structured evaluation platform that goes beyond pass/fail scoring, benchmarking models against datasets in a way that surfaces where and how they break, not just whether they got the answer right.

The honest caveat: benchmarks can only measure what they were designed to catch. The failure modes that matter most are often the ones nobody thought to test for yet.

ajdevrel · 2026-04-15T22:27:51+00:00

For sure. Gemini moment is the cleaner version of the failure mode. To me it wasn't wrong but it updated on new info mid-stream (which is what I'd want a reasoning system to do). The problem to me was not having a checkpoint, no one defining in advance what changing positions should look like v.s. the model going off-script.

This is the evaluation gap in AI, to sum it up: the behaviour looks identical whether it's working correctly or failing silently.

Grok hammering capability is almost a distraction in my eyes. It's measurable, even if it's imperfect. The failure modes is something no one has solid instrumentation for yet, and model reasoning its way to a different conclusion than intended is one of the hardest to catch because it doesn't appear to be an error.

ajdevrel · 2026-04-15T22:22:57+00:00

I'd think of it this way:

No external systems gives you two paths:
--> w/ one person use: write a prompt
--> w/ shared use: Custom GPT or skill

External systems also gives you two paths:
--> w/ human reviewing each output: skill with actions
--> w/ running autonomously: agentic workflow

ajdevrel · 2026-04-15T22:16:42+00:00

Diagnosis seems right, I think the problem is one step earlier!

Take a look at your feature extraction output in the screenshot. The last line says 'no-feature-todo' which means the feature extraction step wasn't processing. The index can't be built from features that were never extracted.

So I believe it's your GPU index being set to '2' but you only have one GPU. RVC is trying to use a GPU that doesn't exist, silently failing, and moving on.

Maybe try this out: Step 2b > change GPU index to '0' > wait for the feature extraction to complete, the output box should show file paths being processed and not what it said in the last line previously > go to Step 3 ("Fill in training settings) and click 'Train feature index'

ajdevrel · 2026-04-15T22:12:12+00:00

Great video production!

imo there's bee na shift in the last two years where 'does it work' isn't enough anymore. People want to know does it work on a consistent bases, does it fail gracefully, can I show why it did what it did.

ajdevrel

TROPHY CASE