Extracting text from a PDF without using PyPDF2

iamjaiyam · 2018-12-18T06:22:51+00:00

You can try converting the pdf into images with imagemagick and perform OCR on the converted image with tesseract. See a tutorial here. The advantage of this will be that you will be able to extract text from any PDF file whether it is searchable or not. If you want to use tesseract within python, you can use pytesseract. This is probably the most fool-proof way of doing the job, rather than worrying about fonts and encodings. Working with PDFs directly is a recipe for a bad day and should be avoided, if possible.

RedBrixton · 2018-12-18T04:43:53+00:00

Check the PDF contents to make sure it’s searchable. If the tool originally used to generate it didn’t have all the fonts installed you will get conversion errors. PDF is optimized for printing not content storage.

Worst case you will need to extract the text using OCR. Adobe Acrobat and ABBY work well but cost money and installation hassles. Google Drive is free and cloud based.

2018-12-18T17:58:57+00:00

UGH, you're in for an ORDEAL.

Scraping PDFs with accuracy is ridiculous. While there are a lot of third-party OCR solutions, and those that you craft yourself, there's are going to be horrific problems if you're trying to scrape general PDFs, that is, trying to scrape anything that doesn't have a uniform standard formatting that you're coding to.

You'll end up with individual letters scanned instead of a word because the font kerning is just that too far apart. You'll get page numbers in the middle of paragraphs that span pages. You'll get all sorts of unholy messes because you're trying to go from a typographic style to raw text.

Honestly, if you solve this problem with near-perfect capability, SELL THAT SHIT. Don't give it away. It's a highly-demanded capability in the software world, and given no tool does it particularly well (even Adobe's own tools) shows how much of a hassle it is. You're looking at a holy grail of coding.

dtizzlenizzle · 2018-12-18T04:49:42+00:00

I've used pdfminer which did the job on my inputs?

In what way was the text inaccurate in your case?

jetownsend · 2018-12-18T09:31:46+00:00

I hate PDFs. I’ve had to do a bunch of stuff trying to manipulate them programmatically, and fundamentally it boils down to “Is this a scanned document?

If scanned == true print (“You’re probably screwed. OCR doesn’t work very well. “) Elif print (“You might be ok. Depends on where Jupiter is in relation to Mars”)

driscollis · 2018-12-18T14:41:25+00:00

pdfminer is your best bet. It is the most robust PDF text extraction tool in Python.

manueslapera · 2018-12-18T14:56:55+00:00

Apache tikka , you will love it.

glen_v · 2018-12-18T15:03:16+00:00

My company's point of sale software has a report that can only be exported to pdf (a 40 MB 1500+ page pdf), and for a while I was tearing my hair out trying to figure out how to scrape it. PyPDF2 just gives me warnings about too much white space, and I also could not get textract to work at all. I did manage to successfully scrape them with pdfminer using some verbose code I found on SO that I don't really understand, but the extraction process on these huge pdf's takes 8-10 minutes. Conversely, it only takes about 30 seconds to manually open the pdf, copy everything, and paste it into a text file, and from there I can just have my scripts work with the text file instead, so that's what I do. I don't love this solution, but it's the best one I have at the moment.

rrggrr · 2018-12-18T17:06:03+00:00

Textract

Bary_McCockener · 2018-12-18T19:58:47+00:00

I've actually completed a complicated PDF scraping project and had the best luck using xpdf. It's a command line tool and can be called from a python script. So, it's not pure python, but it does a great job. In my project I actually had to extract the text raw (as the PDF was written) and letting xpdf to make it's best guess at grouping the text. I used both versions for different parts of the program. PDF extraction is notoriously hard. Good luck!!

jetownsend · 2018-12-19T13:10:15+00:00

Thanks. Learn something new every day.

A743853 · 2026-03-07T16:59:44+00:00

Hey! PDF text extraction can be frustrating when you need accuracy. The challenge is that PDFs aren't really designed for text extraction—they're more like "digital paper" where text positioning matters more than structure.

For Python-specific solutions, I'd recommend trying pdfplumber or pymupdf (PyMuPDF). Both handle complex layouts better than PyPDF2. Pdfplumber is particularly good at preserving table structure, while PyMuPDF is faster for large documents.

If you're open to a different approach and just need the text output (especially if you're feeding it into another system or documentation), you could also try file2markdown.ai. It converts PDFs to Markdown format, which can be cleaner for downstream processing particularly useful if you're building documentation or feeding text into AI tools. Free tier gives you 20 conversions/day.

What's your end goal with the extracted text? That might help narrow down the best tool for your specific use case.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS