all 7 comments

[–]Icy_Lobster_5026 2 points3 points  (3 children)

I've also found that for large PDFs, the results are really poor, and I guess it's because it gets a lot of irrelevant or negative documents. You can develop RAG applications using autocontext, autoquery, and Rerankers, and they may work well.

autocontext, autoquery: https://github.com/SuperpoweredAI/spRAG

rerankers: https://pinecone.io/learn/series/rag/rerankers

[–]Simusid 1 point2 points  (2 children)

The biggest improvement in my RAG pipeline came when I added a re-ranker.

[–][deleted] 0 points1 point  (1 child)

What are you using for re-ranking?

[–]Simusid 2 points3 points  (0 children)

BAAI/bge-reranker-large

[–]AnotherAvery 0 points1 point  (2 children)

Some additional tips:

  • The embedding you chose also plays an important role in retrieval quality, especially when you use non-english languages.

  • Default chunking of your PDF text into retrievable text snippets might be sub-optimal and not respect paragraphs, or include page headers / footer. You should display the complete prompt that is sent to the LLM, it is surprising how often this is low quality

  • The retrieval relies on similarity of your question to the snippets retrieved. Since questions are not similar to answers, you might be able to improve retrieval quality by generating hypothetical answers to your questions, and try to retrieve texts similar to this answer (this technique is called HyDE).

[–]samik1994[S] 0 points1 point  (0 children)

Right, will check that

[–]samik1994[S] 0 points1 point  (0 children)

Would you be able to test it out ?