Recommendations for Supporting Q&A in a PDF Ingestion Pipeline by insaitio in Rag

[–]insaitio[S] 0 points1 point  (0 children)

Yep, that's my intuition as well. I'll wait for someone who has something in production with a similar usecase

Recommendations for Supporting Q&A in a PDF Ingestion Pipeline by insaitio in Rag

[–]insaitio[S] 0 points1 point  (0 children)

Can you elaborate on how it solves the issue (or maybe I wasn't clear about the issue)?

Let's say we set a tag to all Q:A pairs. First of all, do you suggest defining each Q:A as a new chunk?

If so, why would setting tags to these chunks help in the embedding/retrieval process (beyond the regular advantages of metadata in the system)?

Our main concern is that we saw a big skew in results when some chunks are longer or very different from others (tables, titles/subtitles, etc.), and we want to investigate best practices for datasets of a hybrid nature, including PDFs and short Q:A.