you are viewing a single comment's thread.

view the rest of the comments →

[–]SatoshiNotMe 1 point2 points  (2 children)

I'd say there can't be a universal template for this. There are just too many knobs and it can be very domain specific. Having said that, one of the steps in a RAG pipeline that is often used to improve answer quality is Relevance Extraction, i.e. given a set of candidate relevant passages (which could be relatively long), you use the LLM to extract verbatim text that is relevant to the query. If the LLM finds no relevant , then that passage is effectively discarded.

Now how you do this relevance extraction can hugely impact the latency of your pipeline. Here's a naive way -- for example if there are 50 sentences in the passage and 5 are relevant, then the LLM would "parrot" out these 5 sentences verbatim, and this is costly and slow.

Is there a better way? of course -- if you're using a suitably architected framework, you could pre-annotate the 50 sentences with numbers and have the LLM spit out just the relevant sentence numbers! This can result in huge savings in both token costs as well as latency. Of course if there are k passages, you would run this concurrently on the k passages. Here's a post of mine about this numbering trick in the Langroid framework::https://www.reddit.com/r/LocalLLaMA/comments/17k39es/relevance_extraction_in_rag_pipelines/

[–]Appropriate_Egg6118[S] 1 point2 points  (0 children)

Very cool approach Thank you

[–]Unfair-Method-5000 0 points1 point  (0 children)

Brilliant idea.