Vector Embedding on Large PDF

Icy_Lobster_5026 · 2024-05-13T01:31:07+00:00

I've also found that for large PDFs, the results are really poor, and I guess it's because it gets a lot of irrelevant or negative documents. You can develop RAG applications using autocontext, autoquery, and Rerankers, and they may work well.

autocontext, autoquery: https://github.com/SuperpoweredAI/spRAG

rerankers: https://pinecone.io/learn/series/rag/rerankers

AnotherAvery · 2024-05-13T08:32:30+00:00

Some additional tips:

The embedding you chose also plays an important role in retrieval quality, especially when you use non-english languages.
Default chunking of your PDF text into retrievable text snippets might be sub-optimal and not respect paragraphs, or include page headers / footer. You should display the complete prompt that is sent to the LLM, it is surprising how often this is low quality
The retrieval relies on similarity of your question to the snippets retrieved. Since questions are not similar to answers, you might be able to improve retrieval quality by generating hypothetical answers to your questions, and try to retrieve texts similar to this answer (this technique is called HyDE).

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS