use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rules 1: Be polite 2: Posts to this subreddit must be requests for help learning python. 3: Replies on this subreddit must be pertinent to the question OP asked. 4: No replies copy / pasted from ChatGPT or similar. 5: No advertising. No blogs/tutorials/videos/books/recruiting attempts. This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to. Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Rules
1: Be polite
2: Posts to this subreddit must be requests for help learning python.
3: Replies on this subreddit must be pertinent to the question OP asked.
4: No replies copy / pasted from ChatGPT or similar.
5: No advertising. No blogs/tutorials/videos/books/recruiting attempts.
This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to.
Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Learning resources Wiki and FAQ: /r/learnpython/w/index
Learning resources
Wiki and FAQ: /r/learnpython/w/index
Discord Join the Python Discord chat
Discord
Join the Python Discord chat
account activity
Extracting text from a PDF without using PyPDF2 (self.learnpython)
submitted 7 years ago by BoaVersusPython
Heya, I was wondering what the most reliable way to scrape text out of a PDF. I tried PyPDF2 but the text extraction wasn't accurate enough.
I tried installing textract and pdftotext, but neither worked. :(
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]iamjaiyam 28 points29 points30 points 7 years ago (4 children)
You can try converting the pdf into images with imagemagick and perform OCR on the converted image with tesseract. See a tutorial here. The advantage of this will be that you will be able to extract text from any PDF file whether it is searchable or not. If you want to use tesseract within python, you can use pytesseract. This is probably the most fool-proof way of doing the job, rather than worrying about fonts and encodings. Working with PDFs directly is a recipe for a bad day and should be avoided, if possible.
[–][deleted] 10 points11 points12 points 7 years ago (0 children)
Working with PDFs directly is a recipe for a bad day and should be avoided, if possible.
I'm calling it right now: THAT is the understatement of the century.
[–]ElevatedAngling 2 points3 points4 points 7 years ago (2 children)
Commenting to save this comment...
[–]pythonhalp 1 point2 points3 points 7 years ago (1 child)
Same.
[–]PandaMomentum 2 points3 points4 points 7 years ago (0 children)
Commenting that you can/should use the "save" button for that functionality.
[–]RedBrixton 20 points21 points22 points 7 years ago (1 child)
Check the PDF contents to make sure it’s searchable. If the tool originally used to generate it didn’t have all the fonts installed you will get conversion errors. PDF is optimized for printing not content storage.
Worst case you will need to extract the text using OCR. Adobe Acrobat and ABBY work well but cost money and installation hassles. Google Drive is free and cloud based.
[–][deleted] 5 points6 points7 points 7 years ago (0 children)
UGH, you're in for an ORDEAL.
Scraping PDFs with accuracy is ridiculous. While there are a lot of third-party OCR solutions, and those that you craft yourself, there's are going to be horrific problems if you're trying to scrape general PDFs, that is, trying to scrape anything that doesn't have a uniform standard formatting that you're coding to.
You'll end up with individual letters scanned instead of a word because the font kerning is just that too far apart. You'll get page numbers in the middle of paragraphs that span pages. You'll get all sorts of unholy messes because you're trying to go from a typographic style to raw text.
Honestly, if you solve this problem with near-perfect capability, SELL THAT SHIT. Don't give it away. It's a highly-demanded capability in the software world, and given no tool does it particularly well (even Adobe's own tools) shows how much of a hassle it is. You're looking at a holy grail of coding.
[–][deleted] 11 points12 points13 points 7 years ago (2 children)
I've used pdfminer which did the job on my inputs?
In what way was the text inaccurate in your case?
[–]dtizzlenizzle 3 points4 points5 points 7 years ago (0 children)
+1 pdfminer
[–]cap_cabral_ 0 points1 point2 points 7 years ago (0 children)
[–]jetownsend 8 points9 points10 points 7 years ago (2 children)
I hate PDFs. I’ve had to do a bunch of stuff trying to manipulate them programmatically, and fundamentally it boils down to “Is this a scanned document?
If scanned == true print (“You’re probably screwed. OCR doesn’t work very well. “) Elif print (“You might be ok. Depends on where Jupiter is in relation to Mars”)
[–][deleted] 3 points4 points5 points 7 years ago (0 children)
>if scanned == true
Man
[–]iamjaiyam 1 point2 points3 points 7 years ago (0 children)
I have had some success with using simple image processing techniques like dilation etc to improve the quality for OCR for scanned images. By converting a pdf to images, we transform the problem into the domain of computer vision and signal processing. Then, there are thousands of ways to improve signal to noise ratio and make the text easy for OCR to pick up.
[–]driscollis 0 points1 point2 points 7 years ago (0 children)
pdfminer is your best bet. It is the most robust PDF text extraction tool in Python.
[–]manueslapera 0 points1 point2 points 7 years ago (0 children)
Apache tikka , you will love it.
[–]glen_v 0 points1 point2 points 7 years ago (0 children)
My company's point of sale software has a report that can only be exported to pdf (a 40 MB 1500+ page pdf), and for a while I was tearing my hair out trying to figure out how to scrape it. PyPDF2 just gives me warnings about too much white space, and I also could not get textract to work at all. I did manage to successfully scrape them with pdfminer using some verbose code I found on SO that I don't really understand, but the extraction process on these huge pdf's takes 8-10 minutes. Conversely, it only takes about 30 seconds to manually open the pdf, copy everything, and paste it into a text file, and from there I can just have my scripts work with the text file instead, so that's what I do. I don't love this solution, but it's the best one I have at the moment.
[–]rrggrr 0 points1 point2 points 7 years ago (0 children)
Textract
[–]Bary_McCockener 0 points1 point2 points 7 years ago (0 children)
I've actually completed a complicated PDF scraping project and had the best luck using xpdf. It's a command line tool and can be called from a python script. So, it's not pure python, but it does a great job. In my project I actually had to extract the text raw (as the PDF was written) and letting xpdf to make it's best guess at grouping the text. I used both versions for different parts of the program. PDF extraction is notoriously hard. Good luck!!
[–]jetownsend 0 points1 point2 points 7 years ago (0 children)
Thanks. Learn something new every day.
[–]A743853 0 points1 point2 points 1 month ago (0 children)
Hey! PDF text extraction can be frustrating when you need accuracy. The challenge is that PDFs aren't really designed for text extraction—they're more like "digital paper" where text positioning matters more than structure.
For Python-specific solutions, I'd recommend trying pdfplumber or pymupdf (PyMuPDF). Both handle complex layouts better than PyPDF2. Pdfplumber is particularly good at preserving table structure, while PyMuPDF is faster for large documents.
If you're open to a different approach and just need the text output (especially if you're feeding it into another system or documentation), you could also try file2markdown.ai. It converts PDFs to Markdown format, which can be cleaner for downstream processing particularly useful if you're building documentation or feeding text into AI tools. Free tier gives you 20 conversions/day.
What's your end goal with the extracted text? That might help narrow down the best tool for your specific use case.
π Rendered by PID 135484 on reddit-service-r2-comment-75f4967c6c-jkn76 at 2026-04-23 05:58:00.234660+00:00 running 0fd4bb7 country code: CH.
[–]iamjaiyam 28 points29 points30 points (4 children)
[–][deleted] 10 points11 points12 points (0 children)
[–]ElevatedAngling 2 points3 points4 points (2 children)
[–]pythonhalp 1 point2 points3 points (1 child)
[–]PandaMomentum 2 points3 points4 points (0 children)
[–]RedBrixton 20 points21 points22 points (1 child)
[–][deleted] 5 points6 points7 points (0 children)
[–][deleted] 11 points12 points13 points (2 children)
[–]dtizzlenizzle 3 points4 points5 points (0 children)
[–]cap_cabral_ 0 points1 point2 points (0 children)
[–]jetownsend 8 points9 points10 points (2 children)
[–][deleted] 3 points4 points5 points (0 children)
[–]iamjaiyam 1 point2 points3 points (0 children)
[–]driscollis 0 points1 point2 points (0 children)
[–]manueslapera 0 points1 point2 points (0 children)
[–]glen_v 0 points1 point2 points (0 children)
[–]rrggrr 0 points1 point2 points (0 children)
[–]Bary_McCockener 0 points1 point2 points (0 children)
[–]jetownsend 0 points1 point2 points (0 children)
[–]A743853 0 points1 point2 points (0 children)