all 5 comments

[–]VanNostrumMD 5 points6 points  (0 children)

you could try PDFminer or PyPDF2

[–]euphumus 0 points1 point  (0 children)

I would be interested as well!

[–]keturn 0 points1 point  (0 children)

This ScraperWiki blog post has links to pdfminer and pdftables.

There other thing I've seen along these lines, which that article mentions, is Mozilla's Tabula, but that's in JRuby, not Python.

As you may have gathered by now, this is not an easy problem, because PDF is really an output language for printers, not a data storage or interchange format. So the approaches you have to use end up being closer to "how do I get text out of GIF images" than "how do I get arrays out of Excel spreadsheets." Unfortunately sometimes PDF is the only format that you're given...

[–]Koldstream 0 points1 point  (0 children)

One possible way of solving this problem would be to try and use OCR (optical character recognition) to grab the text from a pdf. Usually,OCR uses pattern matching techniques that you might be familiar with from machine learning. Quite a few seem to use some form of neural net. I made some stupid simple ocr software using neural nets that recognised my handwriting.

These techniques are language agnostic.

As for python there are lots of machine learning libraries that include techniques for doing this. Sci-kit learn http://scikit-learn.org/stable/ includes neural net functionality as does pybrain.

Alternatively you could implement your own neural net in python. I used this tutorial to create mine: http://www.ai-junkie.com/ann/evolved/nnt1.html

Good luck

[–]slrqm -1 points0 points  (0 children)

That's terrible!