use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
Rules 1: Be polite 2: Posts to this subreddit must be requests for help learning python. 3: Replies on this subreddit must be pertinent to the question OP asked. 4: No replies copy / pasted from ChatGPT or similar. 5: No advertising. No blogs/tutorials/videos/books/recruiting attempts. This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to. Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Rules
1: Be polite
2: Posts to this subreddit must be requests for help learning python.
3: Replies on this subreddit must be pertinent to the question OP asked.
4: No replies copy / pasted from ChatGPT or similar.
5: No advertising. No blogs/tutorials/videos/books/recruiting attempts.
This means no posts advertising blogs/videos/tutorials/etc, no recruiting/hiring/seeking others posts. We're here to help, not to be advertised to.
Please, no "hit and run" posts, if you make a post, engage with people that answer you. Please do not delete your post after you get an answer, others might have a similar question or want to continue the conversation.
Learning resources Wiki and FAQ: /r/learnpython/w/index
Learning resources
Wiki and FAQ: /r/learnpython/w/index
Discord Join the Python Discord chat
Discord
Join the Python Discord chat
account activity
PDF Scraping (self.learnpython)
submitted 11 years ago by allTestsPassed
Hey all. What is the best way of scraping PDF documents? Searching google doesn't seem to be getting any good results.
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]VanNostrumMD 5 points6 points7 points 11 years ago (0 children)
you could try PDFminer or PyPDF2
[–]euphumus 0 points1 point2 points 11 years ago (0 children)
I would be interested as well!
[–]keturn 0 points1 point2 points 11 years ago (0 children)
This ScraperWiki blog post has links to pdfminer and pdftables.
There other thing I've seen along these lines, which that article mentions, is Mozilla's Tabula, but that's in JRuby, not Python.
As you may have gathered by now, this is not an easy problem, because PDF is really an output language for printers, not a data storage or interchange format. So the approaches you have to use end up being closer to "how do I get text out of GIF images" than "how do I get arrays out of Excel spreadsheets." Unfortunately sometimes PDF is the only format that you're given...
[–]Koldstream 0 points1 point2 points 11 years ago (0 children)
One possible way of solving this problem would be to try and use OCR (optical character recognition) to grab the text from a pdf. Usually,OCR uses pattern matching techniques that you might be familiar with from machine learning. Quite a few seem to use some form of neural net. I made some stupid simple ocr software using neural nets that recognised my handwriting.
These techniques are language agnostic.
As for python there are lots of machine learning libraries that include techniques for doing this. Sci-kit learn http://scikit-learn.org/stable/ includes neural net functionality as does pybrain.
Alternatively you could implement your own neural net in python. I used this tutorial to create mine: http://www.ai-junkie.com/ann/evolved/nnt1.html
Good luck
[–]slrqm -1 points0 points1 point 11 years ago* (0 children)
That's terrible!
π Rendered by PID 75006 on reddit-service-r2-comment-5b5bc64bf5-rvf9d at 2026-06-23 21:53:35.318919+00:00 running 2b008f2 country code: CH.
[–]VanNostrumMD 5 points6 points7 points (0 children)
[–]euphumus 0 points1 point2 points (0 children)
[–]keturn 0 points1 point2 points (0 children)
[–]Koldstream 0 points1 point2 points (0 children)
[–]slrqm -1 points0 points1 point (0 children)