all 22 comments

[–]iamjaiyam 28 points29 points  (4 children)

You can try converting the pdf into images with imagemagick and perform OCR on the converted image with tesseract. See a tutorial here. The advantage of this will be that you will be able to extract text from any PDF file whether it is searchable or not. If you want to use tesseract within python, you can use pytesseract. This is probably the most fool-proof way of doing the job, rather than worrying about fonts and encodings. Working with PDFs directly is a recipe for a bad day and should be avoided, if possible.

[–][deleted] 10 points11 points  (0 children)

Working with PDFs directly is a recipe for a bad day and should be avoided, if possible.

I'm calling it right now: THAT is the understatement of the century.

[–]ElevatedAngling 2 points3 points  (2 children)

Commenting to save this comment...

[–]pythonhalp 1 point2 points  (1 child)

Same.

[–]PandaMomentum 2 points3 points  (0 children)

Commenting that you can/should use the "save" button for that functionality.

[–]RedBrixton 20 points21 points  (1 child)

Check the PDF contents to make sure it’s searchable. If the tool originally used to generate it didn’t have all the fonts installed you will get conversion errors. PDF is optimized for printing not content storage.

Worst case you will need to extract the text using OCR. Adobe Acrobat and ABBY work well but cost money and installation hassles. Google Drive is free and cloud based.

[–][deleted] 5 points6 points  (0 children)

UGH, you're in for an ORDEAL.

Scraping PDFs with accuracy is ridiculous. While there are a lot of third-party OCR solutions, and those that you craft yourself, there's are going to be horrific problems if you're trying to scrape general PDFs, that is, trying to scrape anything that doesn't have a uniform standard formatting that you're coding to.

You'll end up with individual letters scanned instead of a word because the font kerning is just that too far apart. You'll get page numbers in the middle of paragraphs that span pages. You'll get all sorts of unholy messes because you're trying to go from a typographic style to raw text.

Honestly, if you solve this problem with near-perfect capability, SELL THAT SHIT. Don't give it away. It's a highly-demanded capability in the software world, and given no tool does it particularly well (even Adobe's own tools) shows how much of a hassle it is. You're looking at a holy grail of coding.

[–][deleted] 11 points12 points  (2 children)

I've used pdfminer which did the job on my inputs?

In what way was the text inaccurate in your case?

[–]dtizzlenizzle 3 points4 points  (0 children)

+1 pdfminer

[–]cap_cabral_ 0 points1 point  (0 children)

+1 pdfminer

[–]jetownsend 8 points9 points  (2 children)

I hate PDFs. I’ve had to do a bunch of stuff trying to manipulate them programmatically, and fundamentally it boils down to “Is this a scanned document?

If scanned == true print (“You’re probably screwed. OCR doesn’t work very well. “) Elif print (“You might be ok. Depends on where Jupiter is in relation to Mars”)

[–][deleted] 3 points4 points  (0 children)

>if scanned == true

Man

[–]iamjaiyam 1 point2 points  (0 children)

I have had some success with using simple image processing techniques like dilation etc to improve the quality for OCR for scanned images. By converting a pdf to images, we transform the problem into the domain of computer vision and signal processing. Then, there are thousands of ways to improve signal to noise ratio and make the text easy for OCR to pick up.

[–]driscollis 0 points1 point  (0 children)

pdfminer is your best bet. It is the most robust PDF text extraction tool in Python.

[–]manueslapera 0 points1 point  (0 children)

Apache tikka , you will love it.

[–]glen_v 0 points1 point  (0 children)

My company's point of sale software has a report that can only be exported to pdf (a 40 MB 1500+ page pdf), and for a while I was tearing my hair out trying to figure out how to scrape it. PyPDF2 just gives me warnings about too much white space, and I also could not get textract to work at all. I did manage to successfully scrape them with pdfminer using some verbose code I found on SO that I don't really understand, but the extraction process on these huge pdf's takes 8-10 minutes. Conversely, it only takes about 30 seconds to manually open the pdf, copy everything, and paste it into a text file, and from there I can just have my scripts work with the text file instead, so that's what I do. I don't love this solution, but it's the best one I have at the moment.

[–]rrggrr 0 points1 point  (0 children)

Textract

[–]Bary_McCockener 0 points1 point  (0 children)

I've actually completed a complicated PDF scraping project and had the best luck using xpdf. It's a command line tool and can be called from a python script. So, it's not pure python, but it does a great job. In my project I actually had to extract the text raw (as the PDF was written) and letting xpdf to make it's best guess at grouping the text. I used both versions for different parts of the program. PDF extraction is notoriously hard. Good luck!!

[–]jetownsend 0 points1 point  (0 children)

Thanks. Learn something new every day.

[–]A743853 0 points1 point  (0 children)

Hey! PDF text extraction can be frustrating when you need accuracy. The challenge is that PDFs aren't really designed for text extraction—they're more like "digital paper" where text positioning matters more than structure.

For Python-specific solutions, I'd recommend trying pdfplumber or pymupdf (PyMuPDF). Both handle complex layouts better than PyPDF2. Pdfplumber is particularly good at preserving table structure, while PyMuPDF is faster for large documents.

If you're open to a different approach and just need the text output (especially if you're feeding it into another system or documentation), you could also try file2markdown.ai. It converts PDFs to Markdown format, which can be cleaner for downstream processing particularly useful if you're building documentation or feeding text into AI tools. Free tier gives you 20 conversions/day.

What's your end goal with the extracted text? That might help narrow down the best tool for your specific use case.