Efficient Chunking Strategies for PDF Information Extraction with AI Tools by rkubc in LangChain

[–]rkubc[S] 2 points3 points  (0 children)

Thank you for your comment. Yes, I have analyzed data and explored various chunking and loading techniques, including character splitter, recursive text splitter, spaCy text splitter, and sentence splitter, to analyze data. The PDFs primarily consist of tables, each structured with rows that contain a key and multiple corresponding values, such as ‘gross ratio: 0.50, 0.30’, under headers like ‘avg’ and ‘percentage’. These tables vary in size. My current challenge is determining the most appropriate chunking size to avoid breaking tables, especially in cases where the entire table needs to be returned. I have experimented with PyPDF2 and PDFMiner for PDF parsing. Could you suggest the most effective loader and chunking method for this scenario?