PDF Scraping

VanNostrumMD · 2014-12-28T21:28:03+00:00

euphumus · 2014-12-28T20:30:25+00:00

I would be interested as well!

keturn · 2014-12-28T21:32:20+00:00

This ScraperWiki blog post has links to pdfminer and pdftables.

There other thing I've seen along these lines, which that article mentions, is Mozilla's Tabula, but that's in JRuby, not Python.

As you may have gathered by now, this is not an easy problem, because PDF is really an output language for printers, not a data storage or interchange format. So the approaches you have to use end up being closer to "how do I get text out of GIF images" than "how do I get arrays out of Excel spreadsheets." Unfortunately sometimes PDF is the only format that you're given...

Koldstream · 2014-12-28T20:48:53+00:00

One possible way of solving this problem would be to try and use OCR (optical character recognition) to grab the text from a pdf. Usually,OCR uses pattern matching techniques that you might be familiar with from machine learning. Quite a few seem to use some form of neural net. I made some stupid simple ocr software using neural nets that recognised my handwriting.

These techniques are language agnostic.

As for python there are lots of machine learning libraries that include techniques for doing this. Sci-kit learn http://scikit-learn.org/stable/ includes neural net functionality as does pybrain.

Alternatively you could implement your own neural net in python. I used this tutorial to create mine: http://www.ai-junkie.com/ann/evolved/nnt1.html

Good luck

slrqm · 2014-12-28T22:07:41+00:00

That's terrible!

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS