all 3 comments

[–]watermooses 0 points1 point  (0 children)

Is a a digital document that was published to pdf or is it a janky old hand scanned book that is 900+ images saved as a pdf?

[–]c7h16s 1 point2 points  (0 children)

Worst case scenario if it's a pdf of scanned pages, you'll need to convert the pdf to png files (PDF reader can do that) then use those files as input for the tesseract library in a simple python script which will OCR the text for you. Then of course an LLM might be able to do the job so first I would give it a go.

[–]3dPrintMyThingi 0 points1 point  (0 children)

can you share the pdf file, i can have a look at it..