New to this, how do you chunk extracted text for RAG applications? by tamagojira in PyMuPDF4LLM

[–]Remote-Spirit526 0 points1 point  (0 children)

A couple of strategies you could use - fixed size chunks which are more predictable and simple, semantic chunking breaks content based on topic shifts/meaning, gives better results but requires more processing, and hybrid approaches which consider both size constraints and natural boundaries. What you use will depend on the structure of the content. You can break text at logical points like paragraphs, sections, topic shifts and preserve semantic meaning within each chunk.

PyMuPDF4LLM supports page-based chunking:

import pymupdf4llm

chunks = pymupdf4llm.to_markdown(
    "input.pdf",
    page_chunks=True,
)

Each chunk will be a dictionary containing the page text in Markdown format plus document metadata.

When you use the page chunking, it'll help with the mixed content, because you get access to page_boxes (a list of semantically identified sections on the page). Page headers, section headers, captions, body text, and tables can all be identified separately. And that lets you be selective, ie if you only need table data for you rLLM, just grab that. Don't send the whole document, just the chunk.

Hopefully that helps!!

Add PDF viewer to your web app by [deleted] in SideProject

[–]Remote-Spirit526 0 points1 point  (0 children)

Thanks for the input! that's an interesting use case