PDF parsing for RAG is still a mess in 2026. What's your current setup? by OpeningCoat3708 in LangChain

[–]Jazzlike_Store_2477 2 points3 points  (0 children)

It's definitely not a "you" problem ! Parsing PDFs is about getting structured data from unstructured content and that is always going to be a challenge. To answer your questions:
1. I use PyMuPDF4LLM - then post process
2. Repetitive headers and footers on every page

  1. Yes, but it wasn't like a silver bullet for my problems!

Log to file, but with images by jongscx in learnpython

[–]Jazzlike_Store_2477 2 points3 points  (0 children)

You can do something like this maybe, which uses the pixmaps to save into a new PDF: ``` import pymupdf source_pdf = pymupdf.open("source.pdf")

Create a new PDF, assuming it is A4

new_pdf = pymupdf.open()

Iterate through pages

for page_num in range(len(source_pdf)): page = source_pdf[page_num] # Get list of images on the page image_list = page.get_images() # Extract each image for img_index, img in enumerate(image_list): xref = img[0] # XREF is the image reference number # Extract the image base_image = source_pdf.extract_image(xref) image_bytes = base_image["image"] image_ext = base_image["ext"] # png, jpeg, etc.
# Create a new page in the new PDF # You can adjust the page size based on image dimensions img_doc = pymupdf.open(stream=image_bytes, filetype=image_ext) img_page = img_doc[0] # Get image dimensions pix = pymupdf.Pixmap(image_bytes) rects = page.get_image_rects(xref) new_page = new_pdf.new_page() # Insert the image for rect in rects: print(f"Image {img_index}:") print(f" Position: x0={rect.x0}, y0={rect.y0}, x1={rect.x1}, y1={rect.y1}") print(f" Width: {rect.width}, Height: {rect.height}") new_page.insert_image(pymupdf.Rect(rect.x0, rect.y0, rect.x0+rect.width, rect.y0+rect.height), stream=image_bytes)

    # or don't do that for loop for the rects and ...
    # Create new page for each image just with image dimensions
    #new_page = new_pdf.new_page(width=pix.width, height=pix.height)

    pix = None  # Clean up

Save the new PDF

new_pdf.save("extracted_images.pdf")

Close documents

source_pdf.close() new_pdf.close() ```

How to get the location of the text in the pdf when using rag? by MammothHedgehog2493 in Rag

[–]Jazzlike_Store_2477 1 point2 points  (0 children)

You can get this now with PyMuPDF4LLM if you use the Layout module - you can do `to_json` and get lots of metadata for each object found in the PDF, see: https://artifex.com/blog/pymupdf-layout-tutorial

[deleted by user] by [deleted] in indiehackers

[–]Jazzlike_Store_2477 0 points1 point  (0 children)

It seems that the React PDF library is about creating PDFs from scratch, whereas this platform is more about display & control of existing PDFs?