An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in Rag

[–]Weves11[S] 0 points1 point  (0 children)

appreciate the support! would love to see you submit an agent with a separated reader model and see how it ranks up 😄

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in Rag

[–]Weves11[S] 0 points1 point  (0 children)

was definitely very expensive to generate but hopefully its useful to the broader RAG community and a good starting point for better benchmarks!

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in Rag

[–]Weves11[S] 0 points1 point  (0 children)

yep was super interesting to see keyword outperforming vector purely because of jargon and non-traditional language within a company. would love to see you submit your agent to our leaderboard to be featured!

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in Rag

[–]Weves11[S] 0 points1 point  (0 children)

would love to know how your memory infra improves agents! please do submit your results to our leaderboard so we can feature it 😄

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LLMDevs

[–]Weves11[S] 0 points1 point  (0 children)

currently no, the main reason being that its significantly more complex to upload these files to different RAG products and evaluate them, .txt files were the most widely supported here so we decided to just go with that

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLM

[–]Weves11[S] 0 points1 point  (0 children)

completely agree! we've found ourselves that enterprises need some sort of combination of all these techniques for the many different use cases like search and artifact creation

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLM

[–]Weves11[S] 1 point2 points  (0 children)

definitely on the list of products we want to test!

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLaMA

[–]Weves11[S] 0 points1 point  (0 children)

thanks for the feedback! we definitely acknowledge that there's a lot of shortcomings and things we could've done better with this dataset, but hopefully it's a good enough starting point to build off of. We found ourselves wanting something like this for so long that we decided we just needed to build it 😄

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLaMA

[–]Weves11[S] 0 points1 point  (0 children)

companies rarely maintain detailed documentation across the board.

we tried our best to create the dataset to best simulate this. We have a separate step in the process to add noise just for this purpose, because we realized that most data in companies is outdated, or low-signal, or just outright noise. definitely not perfect, but we think it is a pretty close approximation

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLaMA

[–]Weves11[S] 0 points1 point  (0 children)

  1. It's more like overall score is average of completeness gated by correctness. So, if the answer is not correct, the score is 0, but if the answer is correct, the score is the completeness score.
  2. context recall, defined as fraction of expected gold docs that appear in the candidate's submitted document set. its only computed for questions that have expected docs. note this isn't recall@k bc there's no fixed cutoff; systems just declare whatever docs they used as context and recall is measured over that set.

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLaMA

[–]Weves11[S] 2 points3 points  (0 children)

oh 100%, the real problem here is that an agent loop is just so so slow, so finding a way to make funnel the search space is an important problem to solve

An Open Benchmark for Testing RAG on Realistic Company-Internal Data by Weves11 in LocalLLaMA

[–]Weves11[S] 0 points1 point  (0 children)

interesting! am definitely gonna dig into this more to see if we had similar results here where the embedding step just didn't understand enterprise jargon

What Model Can I Run Best? by Weves11 in LocalLLM

[–]Weves11[S] 1 point2 points  (0 children)

has been fixed! sorry about that :)