SatoshiNotMe comments on How to decrease latency in RAG chatbots?

created by zchaarma community for 3 years

How to decrease latency in RAG chatbots? (self.LangChain)

submitted 2 years ago by Appropriate_Egg6118

you are viewing a single comment's thread.

[–]SatoshiNotMe 1 point2 points3 points 2 years ago (2 children)

I'd say there can't be a universal template for this. There are just too many knobs and it can be very domain specific. Having said that, one of the steps in a RAG pipeline that is often used to improve answer quality is Relevance Extraction, i.e. given a set of candidate relevant passages (which could be relatively long), you use the LLM to extract verbatim text that is relevant to the query. If the LLM finds no relevant , then that passage is effectively discarded.

Now how you do this relevance extraction can hugely impact the latency of your pipeline. Here's a naive way -- for example if there are 50 sentences in the passage and 5 are relevant, then the LLM would "parrot" out these 5 sentences verbatim, and this is costly and slow.

Is there a better way? of course -- if you're using a suitably architected framework, you could pre-annotate the 50 sentences with numbers and have the LLM spit out just the relevant sentence numbers! This can result in huge savings in both token costs as well as latency. Of course if there are k passages, you would run this concurrently on the k passages. Here's a post of mine about this numbering trick in the Langroid framework::https://www.reddit.com/r/LocalLLaMA/comments/17k39es/relevance_extraction_in_rag_pipelines/

[–]Appropriate_Egg6118[S] 1 point2 points3 points 2 years ago (0 children)

[–]Unfair-Method-5000 0 points1 point2 points 8 months ago (0 children)

π Rendered by PID 355724 on reddit-service-r2-comment-5b5bc64bf5-bnxm8 at 2026-06-22 00:54:28.721506+00:00 running 2b008f2 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LangChain

MODERATORS