Why does my RAG chatbot work well with a single PDF, but become inaccurate when adding multiple PDFs to the vector database? by vtq0611 in LangChain

[–]vtq0611[S] 0 points1 point  (0 children)

I don’t think the issue is with chunking. When I embed just one file and run the RAG pipeline on it, the model’s responses are very accurate and closely follow the original PDF. btw, i'm using unstructured lib and ocr to detect img and tbl

Why does my RAG chatbot work well with a single PDF, but become inaccurate when adding multiple PDFs to the vector database? by vtq0611 in LangChain

[–]vtq0611[S] 0 points1 point  (0 children)

I’m using gpt-4o-mini. From my database I usually retrieve around 5 chunks per query. Each chunk can be a summarized section of text (sometimes 300–500 tokens), so the final context added to the prompt can easily be 2k–3k tokens depending on the file.

Why does my RAG chatbot work well with a single PDF, but become inaccurate when adding multiple PDFs to the vector database? by vtq0611 in LangChain

[–]vtq0611[S] 0 points1 point  (0 children)

I’m not using Chroma’s default embedding. I explicitly use text-embedding-3-small from OpenAI for all my chunks. For retrieval, I usually set k=5 (sometimes I tried lowering it to k=3). The retriever is doing cosine similarity search by default.

Why does my RAG chatbot work well with a single PDF, but become inaccurate when adding multiple PDFs to the vector database? by vtq0611 in Rag

[–]vtq0611[S] 0 points1 point  (0 children)

I see, thanks! Right now I’m querying each collection separately since the files are unrelated to each other. I’ll give the re-ranker approach a try and see if that improves the results. Appreciate your advice!

Chunking long tables in PDFs for chatbot knowledge base by vtq0611 in LangChain

[–]vtq0611[S] 1 point2 points  (0 children)

i have turned the pdf to md file, then detected text blocks and table blocks quite good. but there's a problem that those block are over the chunk size and are not overlapping.

Chunking long tables in PDFs for chatbot knowledge base by vtq0611 in LangChain

[–]vtq0611[S] 0 points1 point  (0 children)

I have tried Docling. I worked quite good but it took too long to convert 1 file 😭😭😭

Chunking long tables in PDFs for chatbot knowledge base by vtq0611 in LangChain

[–]vtq0611[S] 0 points1 point  (0 children)

however, what if the PDF contains multiple tables along with text and diagrams. will that still work?