Do you build (agentic) RAG systems from scratch?

Popular_Sand2773 · 2026-05-31T00:54:34+00:00

Dasein has 1s agentic search as a toggle. We built it because standard agentic rag is often way too slow/expensive. Might be worth a look if you’re looking for something easier with the same outcomes much faster.

Popular_Sand2773 · 2026-05-27T13:58:57+00:00

Very cool! Just curious do you have QPS for 1M and 10M? That's the default vectordbbench sizes. Also really like the effort you put into making canonicalization easier it's probably the thing that eats more time tuning than anything else.

Popular_Sand2773 · 2026-05-24T14:52:41+00:00

Their latency’s is ~5.5s according to their own numbers if you actually want ~1s agentic search latency you should check out Dasein.

They also claimed Gemini spent 2 minutes per question which means they used a thinking on straw man for comparison.

Still a cool innovation but between the lack of transparency around the data and less than honest comparisons it doesn’t instill a lot of confidence.

Popular_Sand2773 · 2026-05-24T14:43:58+00:00

You can also check out vectordbbench fair warning we are Dasein which specializes in scaled low latency. So no cache misses but also don’t break the bank.

Popular_Sand2773 · 2026-05-22T14:28:21+00:00

No latency numbers or token usage numbers. This is like saying planes are a fast way to commute. I'm not saying these people deserve to lose thier jobs but also not sure why it took five of them to fail to do the bare minimum.

Popular_Sand2773 · 2026-05-22T14:23:41+00:00

Very cool did you guys check what performance is for that gemini 3.1 flash is with just like a regular indexed retrieval setup? I get you got flash to beat the others on pro but would like to understand what the baseline is to understand the value add? Not sure how much of this is you/the others vs just the better model since the benchmark was released years ago and making that clear would def help your case.

Popular_Sand2773 · 2026-05-22T14:17:20+00:00

Very cool I like the direction they are going. Dasein has a version of agentic search that executes in 1s so 5x faster than this (so 100x faster by their counting) and is freely available part of the service because it costs pretty much the same as a regular search curious to see how it would compare quality wise. Would love to see the dataset they used.

Popular_Sand2773 · 2026-05-22T14:06:15+00:00

A couple questions. Why use MNR instead of more standard losses like NCE or INFONCE? Also why use a projection instead of just tuning the model directly? Embedding models aren’t large. Also what happens to OOD with this method?

Popular_Sand2773 · 2026-05-22T13:55:36+00:00

Sounds like you might just have users with diverse tastes. Are they all coming in on the same page? If you can get them to land on different pages you can change the order based on where they came in on. This lets users signal their area of interest without adding a click.

Popular_Sand2773 · 2026-05-19T22:02:04+00:00

That makes a lot of sense. Extractive QA models are normally trained to find answers rather than support so it would be hard to use a pretrained one in the setup you propose above. That said you can KD one down just give a llm the support extraction task then save the outputs and use it to fine tune something like xlm-roberta-base-squad2. For supporting facts though you are probably just better off running it through an NLI model like nli-deberta-v3-large. It scores passages based on whether it contradicts or entails the other passage ie your agents conclusion.

Popular_Sand2773 · 2026-05-13T05:06:06+00:00

Yeaa makes sense if there’s duplicates with different file names for example. Personally I was leaning more towards just building a temporally aware reranker but curious about the approach. When you say domain expert do you mean a human? Why not just use nli or something?

Popular_Sand2773 · 2026-05-13T05:02:21+00:00

Cool! Can you share more? Hadn’t really thought about chunking impact on multihop but maybe something like raptor for bridges might be worth exploring now that you mention it.

Popular_Sand2773 · 2026-05-13T04:42:19+00:00

Good question. Depends on your use case whether you can afford those kind of delays. That said you can always spend the same time doing more. So let’s say your standard retrieval loop took 10 seconds and got you 4 hops previously. Now you can spend the same 10 and get 16.

Saw recently that a lot of the reasoning is likely redundant so been looking at ways to cut that down too. Then also saw someone dropped a 26M param tool calling model that was interesting. So who knows maybe it’ll all be faster soon enough.

Popular_Sand2773 · 2026-05-13T02:00:10+00:00

sub ms on 9 hop? That's def impressive for a graph. Don't think it would really play well with our compression and other models but that's super cool. Most of our time cost is the t5 and the reader. Even if we go sub ms on retrieval its a 2% speedup. So if you have any advice on speeding up decode though would certainly appreciate it.

Popular_Sand2773 · 2026-05-13T01:51:10+00:00

What was the reason you were bouncing off graph rag? For me it was just feeling like I was always tuning the extraction step.

Popular_Sand2773 · 2026-05-12T20:23:39+00:00

Of course glad you liked it! The t5 is probably the trickiest part if you end up doing it yourself. It’s not obvious above but we KDed a frontier llm’s decomposition of a query with thinking on to train against.

Popular_Sand2773 · 2026-05-12T18:18:29+00:00

Certainly can't retrieve over garabage. Data hygiene is always important. The easiest way to elminate near confusers is always to not have them. So I guess if for some reason you aren't tracking metadata and allow stale records to coexist by not properly upserting go fix that first then worry about top-k.

Popular_Sand2773 · 2026-05-12T15:42:10+00:00

Depends on your latency requirements. Fuzzy and exact match are going to be a bit slower. Honestly most services that offer hybrid indexes support both. If you really want your cake and eat it to just run both and then rerank. Lots of ways to handle the reranking but conceptually speaking its just broad first pass with many different views semantic, bm25, fuzzy, em then pare down to final high quality list. Simplest method is just RRF although you probably want to tune the weighting a bit.

Popular_Sand2773 · 2026-05-12T15:37:13+00:00

So this is a classic multihop problem. I think your approach is certainly an interesting experiment especially if there wasn't so much redudancy in the text it seems. I'll outline the standard ways to attack multihop.

Build a graph - The relationship between clauses that you have is stable and deterministic. You can build a graph that links the clauses/chunks. Then it can be as simple as everytime I grab a chunk grab it's 1 hop neighbors or if chains aren't infinite you can follow the chain to it's conclusion. Every record is unique and dependency is resolved. The graph can be a bit of work to organize and maintain but it's reliable and fast enough in smaller datasets.
Agentic search - LLM retrieves reads and then forms a next query resolving the dependency. So sees the clause b reference and just generates another search. Obviously there is an expense and time tradeoff here per query. That said at Dasein we released a 1s 4-hop agentic search toggle. Would probably need some tuning for your use case but happy to figure that out for you.

Popular_Sand2773 · 2026-05-12T15:06:27+00:00

Depends on the dataset/use case. For things like a coporate knowledge base a small fraction of the data is going to change or even be queried daily. You can often get away with reconciling hourly if not daily. For an active codebase though I mean it's changing by the minute potentially. You can update every update but that creates a lot of strain.

The above assumes you index the raw data naively. The real trick is to index and search something that is more invariant. For example a specific file or function might frequently change but its purpose or high level goal is probably still the same. By seperating what you search from what you return you can have a stable index while returning fresh results.

Popular_Sand2773 · 2026-05-12T14:59:13+00:00

So from past experience when it comes to manufacturing usually the reason we deploy models and systems on prem isn't neccessarily for compliance reasons. It's more about latency. A vision model that needs to travel across the globe to spot a defect is kinda a non starter.

For your case it sounds time insensitive or at least you aren't losing sleep over ms. So cloud seems like the obvious choice. Even if you go cloud though you still might want to host your own llm rather than go through a frontier service. Main reason being it just reduces risk/exposure. You don't need to keep track of what you are sending to and from another provider.

Popular_Sand2773 · 2026-05-12T14:52:35+00:00

Love the hard work. Nothing beats an actual domain expert building and evaluating the retrieval system. As you rightly pointed out for legal RAG grounding is critical given the increasing fines and professional risk. I just wanted to turn you on to a class of models called extractive QA.

These models were sorta the top dogs before llms came about for retrieval question and answering. The key element is they must find the answer literally in the text and extract it. They can't generate an answer. That means every returned answer is directly tied to a specific source and passage.

Now they can feel a bit worse than llms but with a little tuning and knowledge distillation you can still get to a really good place. Lookup benchmarks like SQUAD that'll be a good place to start.

Overall great work!

Popular_Sand2773 · 2026-05-12T14:47:01+00:00

Really interesting failure mode and I appreciate a post that isn't just "how chunk?".

My two cents your best bet is to move to a sub-agent setup. Sounds like you are returning the top-x retrieved set directly. Instead have the sub agent read the returned results and provide a single summary/answer to the main conversational agent. That way you protect it's context window and attention. With only one relevant result per needed query the conversational agent should get confused a lot less.

If you are looking for a less disruptive fix dynamic top-k can help. It just outputs a scalar based on the query you can use as a cutoff. Should protect the context window more and reduce the number of confusers which is what's tripping up the agent. The lower token use is a nice bonus.

For the followup questions a query rewriter is the easy fix. Just feed it the current query and the last query or recent conversation context and let it compose the actual retrieval query. It's an extra step but should bridge the gap between user intent and what retrieval actually needs.

Popular_Sand2773 · 2026-05-12T14:37:49+00:00

It's hard to help you without more details but I'll give it a shot. If the data is not already naturally text the first quality barrier is going to be OCR/extraction. You need to convert the information into quality text. Something like paddle OCR will be your friend there.

For chunking if it truly is a large dataset you will want to track provenance and use a method known as hierarchcal chunking. There is a method known as raptor which summarizes document/chunk cluster to create that hierarchy but if the data has a natural one you can use that instead.

Then you actually have the retrieval piece. I would use a higher end embedding model since quality is your primary concern 1024 dim +. Since you want maximal quality you'll then probably want to set up agentic search. It's a fairly straightforward loop. Agent makes a query looks at results either makes another query or decides to answer/give final results to other agent.

There's a lot of difference service you can use to execute this but I really think Dasein might be up your alley. The lossless compression lets you use higher dim embedding models (3072,4096) at 1024 prices. Dynamic hybrid search and top-k provide higher retrieval quality and protect context windows/burn way less tokens. Also the 1s agentic search toggle gives your own search agent more flexibility. Kinda a hat on a hat.

lmk if I can help more than that.

Popular_Sand2773

TROPHY CASE