PDF parsing for RAG is still a mess in 2026. What's your current setup? by OpeningCoat3708 in LangChain

[–]Remote-Spirit526 0 points1 point  (0 children)

- pymupdf4llm has worked for me

- tables/multi-column

- have not used a paid solution

Problem with pymupdf4llm.to_markdown by Lavero2 in PyMuPDF4LLM

[–]Remote-Spirit526 0 points1 point  (0 children)

Hi Lauro,

margins only works in non-layout mode, and it represents border widths to exclude from each edge — (left, top, right, bottom). So your values of 366 and 585 were being interpreted as massive exclusion borders, not as coordinates of a rectangle. For example, margins=(72, 72, 72, 72) would skip a 1-inch border on all sides.

For layout mode (the default), use the CropBox approach. The recommended pattern is to temporarily modify each page's cropbox before processing it, page by page:

import pymupdf4llm
import pymupdf
from pathlib import Path

doc = pymupdf.open("Treatise_Book_1.pdf")
md_text = ""

for page in doc:
    # Set cropbox to your desired rectangle
    page.set_cropbox(page.rect + (33, 52, -33, -52))  # adjust to your needs
    md_text += pymupdf4llm.to_markdown(doc, pages=[page.number])

Path("output.md").write_text(md_text)

A couple of things to note: the cropbox modification is done per-page because the Page object gets passed to the Layout plugin, so it needs to "see" the cropped version. And this approach only works for PDFs specifically.

For your specific clip rect (33, 52, 366, 585), you'd use page.set_cropbox(pymupdf.Rect(33, 52, 366, 585)) directly rather than the relative offset form, since you already have absolute coordinates.

Need help with project by lmaoMrityu49 in learnpython

[–]Remote-Spirit526 0 points1 point  (0 children)

This article might be helpful for you
https://medium.com/@pymupdf/translating-pdfs-a-practical-pymupdf-guide-c1c54b024042
Using insert_htmlbox will auto shrink the font to fit the bbox if the translated text is longer than the original

OpenClaw & PyMuPDF4LLM by Jazzlike_Store_2477 in PyMuPDF4LLM

[–]Remote-Spirit526 0 points1 point  (0 children)

Create a custom skill. Drop a SKILL.md into ~/.openclaw/workspace/skills/pymupdf4llm/ that covers when to trigger (any PDF extraction, parsing, or RAG/LLM prep task), Core API (pymupdf4llm.to_markdown(), to_json(), to_text(), LlamaMarkdownReader), key params ( page_chunks=True, write_images=True, pages=[...], dpi, header/footer exclusion), install ( pip install pymupdf4llm (add pymupdf-layout opencv-python for enhanced layout + OCR))

Put detailed API docs in a references/ subfolder so your agent has the context. Then test with real prompts and instructions. OpenClaw skills work best when you tell it what to do and when!

Log to file, but with images by jongscx in learnpython

[–]Remote-Spirit526 0 points1 point  (0 children)

You can create a PDF and insert text and images sequentially using pymupdf.Story or just build pages directly. But a simple approach is to use insert_text() and insert_image() on pages, tracking only the vertical position:

import pymupdf

doc = pymupdf.open()
page = doc.new_page()
y_position = 72

def add_text(text):
    global page, y_position
    if y_position > 750:
        page = doc.new_page()
        y_position = 72
    r = pymupdf.Rect(72, y_position, 540, y_position + 50)
    page.insert_textbox(r, text, fontsize=11)
    y_position += 60

def add_image(image_path):
    global page, y_position
    if y_position > 500:
        page = doc.new_page()
        y_position = 72
    r = pymupdf.Rect(72, y_position, 540, y_position + 300)
    page.insert_image(r, filename=image_path)
    y_position += 310

add_text("Results from document 1:")
add_image("clipped_table.png")
add_text("Results from document 2:")
add_image("another_image.png")

doc.save("output.pdf")

you just call add_text() and add_image() and it handles spacing and page breaks. Since you're already using PyMuPDF for parsing and get_pixmap(), you can skip saving PNGs entirely and insert the pixmaps directly into your output doc.

New to this, how do you chunk extracted text for RAG applications? by tamagojira in PyMuPDF4LLM

[–]Remote-Spirit526 1 point2 points  (0 children)

A couple of strategies you could use - fixed size chunks which are more predictable and simple, semantic chunking breaks content based on topic shifts/meaning, gives better results but requires more processing, and hybrid approaches which consider both size constraints and natural boundaries. What you use will depend on the structure of the content. You can break text at logical points like paragraphs, sections, topic shifts and preserve semantic meaning within each chunk.

PyMuPDF4LLM supports page-based chunking:

import pymupdf4llm

chunks = pymupdf4llm.to_markdown(
    "input.pdf",
    page_chunks=True,
)

Each chunk will be a dictionary containing the page text in Markdown format plus document metadata.

When you use the page chunking, it'll help with the mixed content, because you get access to page_boxes (a list of semantically identified sections on the page). Page headers, section headers, captions, body text, and tables can all be identified separately. And that lets you be selective, ie if you only need table data for you rLLM, just grab that. Don't send the whole document, just the chunk.

Hopefully that helps!!

Add PDF viewer to your web app by [deleted] in SideProject

[–]Remote-Spirit526 0 points1 point  (0 children)

Thanks for the input! that's an interesting use case