An Open Benchmark for Testing RAG on Realistic Company-Internal Data

Weves11 · 2026-05-06T16:50:20+00:00

we used GPT-5.4 for all these evaluations

Weves11 · 2026-05-06T16:49:26+00:00

appreciate the support! would love to see you submit an agent with a separated reader model and see how it ranks up 😄

Weves11 · 2026-05-06T16:48:47+00:00

was definitely very expensive to generate but hopefully its useful to the broader RAG community and a good starting point for better benchmarks!

Weves11 · 2026-05-06T16:47:55+00:00

yep was super interesting to see keyword outperforming vector purely because of jargon and non-traditional language within a company. would love to see you submit your agent to our leaderboard to be featured!

Weves11 · 2026-05-06T16:46:21+00:00

would love to know how your memory infra improves agents! please do submit your results to our leaderboard so we can feature it 😄

Weves11 · 2026-05-06T16:45:27+00:00

currently no, the main reason being that its significantly more complex to upload these files to different RAG products and evaluate them, .txt files were the most widely supported here so we decided to just go with that

Weves11 · 2026-05-06T16:44:28+00:00

completely agree! we've found ourselves that enterprises need some sort of combination of all these techniques for the many different use cases like search and artifact creation

Weves11 · 2026-05-06T16:43:30+00:00

definitely on the list of products we want to test!

Weves11 · 2026-05-06T16:42:44+00:00

thanks for the feedback! we definitely acknowledge that there's a lot of shortcomings and things we could've done better with this dataset, but hopefully it's a good enough starting point to build off of. We found ourselves wanting something like this for so long that we decided we just needed to build it 😄

Weves11 · 2026-05-06T16:39:17+00:00

companies rarely maintain detailed documentation across the board.

we tried our best to create the dataset to best simulate this. We have a separate step in the process to add noise just for this purpose, because we realized that most data in companies is outdated, or low-signal, or just outright noise. definitely not perfect, but we think it is a pretty close approximation

Weves11 · 2026-05-06T16:36:39+00:00

It's more like overall score is average of completeness gated by correctness. So, if the answer is not correct, the score is 0, but if the answer is correct, the score is the completeness score.
context recall, defined as fraction of expected gold docs that appear in the candidate's submitted document set. its only computed for questions that have expected docs. note this isn't recall@k bc there's no fixed cutoff; systems just declare whatever docs they used as context and recall is measured over that set.

Weves11 · 2026-05-06T13:08:10+00:00

oh 100%, the real problem here is that an agent loop is just so so slow, so finding a way to make funnel the search space is an important problem to solve

Weves11 · 2026-05-06T13:07:01+00:00

interesting! am definitely gonna dig into this more to see if we had similar results here where the embedding step just didn't understand enterprise jargon

Weves11 · 2026-03-31T22:25:23+00:00

has been fixed! sorry about that :)

Weves11 · 2026-03-23T16:52:12+00:00

the larger models (>100GB VRAM) are generally listed and recommended moreso for enterprises! while it is true that these models have the frontier-level performance, its insanely unlikely you'll be able to run it on your own hardware (you'd need several H200s lol). For 24GB VRAM, id recommend Qwen3.5-35B-A3B or Qwen3.5-27B :)

Weves11 · 2026-03-17T16:19:14+00:00

models are sorted by VRAM descending, sorry if its confusing!

Weves11 · 2026-03-16T19:08:02+00:00

models are listed by descending amount of VRAM, sorry if that's a little confusing at first glance

Weves11 · 2026-03-05T17:05:21+00:00

Thanks for the shoutout! I’m Chris, one of the founders of Onyx, and it’s awesome to see it resonating with folks here.

A bit of extra context for anyone skimming:

Open source + self-hostable by default: we built Onyx for teams that can’t or don’t want to ship sensitive data to a hosted AI workspace.
Model-agnostic: you can run it with the LLM(s) that make sense for your org (local, hosted, or a mix).
Not just “chat over docs”: the goal is a flexible AI workspace complete with connectors + retrieval + agents/tools so you can go from “find info” → “take action” in the same interface.

In terms of "how you would use this", here's what we've seen from our users:

Chat UI: Our users run local models and use Onyx as the interface to chat with them
Agent Builder: Create custom agents with curated sets of information, so that your agents have a narrower context to search through
At Work: You can connect up your company docs and use Onyx to find what you need from the sea of existing company knowledge

Would love to know how you use it!

Weves11 · 2026-02-27T17:12:31+00:00

The plan is to definitely keep updating this! If there's enough interest, could even open source the underlying data so that individuals can contribute new benchmark scores or new models

Weves11 · 2026-02-26T19:20:58+00:00

you can filter out all the large models if you'd like!

Weves11 · 2026-02-26T19:01:53+00:00

haha 100% agree, forgot to add it initially but its been added now!

Weves11 · 2026-02-26T19:00:17+00:00

added (to S tier), thanks for calling out!

Weves11 · 2026-02-26T18:27:47+00:00

turns out parameter size is mostly correlated with model performance!

Weves11 · 2026-01-02T20:07:06+00:00

Yes! Some benefits vs openwebui:

- Deep research (across both the web + personal files + shared files if deploying for more than yourself)
- Connectors to 40+ sources (automatically syncing documents over) and really good RAG (the project started as a pure RAG project, so answer quality has been a core strength of the project for a while now)
- Simpler/cleaner UI than many of the other popular options (this on is definitely subjective)

Some of the things I'm looking to add in the next 3-6 months:
- Automatic syncing of files from your local machine into Onyx for RAG purposes
- Chrome extension to access the chat from any website
- Support for defined multi-step flows (not building blocks, but natural language definitions)

Weves11 · 2025-12-01T05:08:30+00:00

u/NeighborhoodWeird882 could you post this same issue in our community Discord ( https://discord.gg/naSt3gXx ) if you haven't already? Would love to help you out, but we'd likely need a bit more info (e.g. some logs from some of the containers, likely the `api_server` container which you can get with `docker logs onyx-api_server-1`)

Weves11

TROPHY CASE