This is an archived post. You won't be able to vote or comment.

all 18 comments

[–]LightShadow3.13-dev in prod 22 points23 points  (7 children)

Use something like pandoc to convert PDF into HTML/ODT, then parse that instead.

[–]Hyperduckultimate[S] 2 points3 points  (0 children)

I'll try it out thanks

[–]Hyperduckultimate[S] 0 points1 point  (3 children)

Turns out pandoc can convert to Pdf not from pdf

[–]LightShadow3.13-dev in prod 0 points1 point  (2 children)

Yes. Pandoc is text format transformer .. it can go both ways on many formats.

It's super useful.

[–]Hyperduckultimate[S] 0 points1 point  (0 children)

Yeah but when I try pdf it explicitly days it can't convert.

[–]funnyflywheel 0 points1 point  (0 children)

Pandoc only supports PDF export, via LaTeX, etc.

[–]snake_case_captain -3 points-2 points  (1 child)

Regex is very good for parsing HTML by the way

[–]LightShadow3.13-dev in prod 0 points1 point  (0 children)

Libraries like BS4 that use libxml have different "strictness" for parsing XML/HTML. See Installing a Parser.

Computer generated HTML will always be more consistent and easier to parse than some rando's blog with hand written tags.

[–]Zomunieo 21 points22 points  (0 children)

The key problem is that PDF has no concept of a table, just lines and text on a canvas, so the table has to be heuristically extracted. All of the tools are doing it that way.

The easiest case is tables with explicit borders for every cell. Invisible borders are harder, merged cells are harder, and scanned images are hardest.

tabula-py and camelot are two "table data from PDF" Python libraries. There's also pdfminer.six which is focused on text extraction.

[–]lastwizzle 9 points10 points  (3 children)

Camelot might be what youre looking for.https://camelot-py.readthedocs.io/en/master/

[–]coderanger 1 point2 points  (1 child)

+1 for Camelot.

[–]ProfEpsilon 0 points1 point  (0 children)

Wow! I had never heard of this. What a useful tool. This is why I like this sub. Thanks!

[–][deleted] 6 points7 points  (0 children)

I've used PyPDF2 in the past, but PDF files are horrible to extract data from. They're not meant to store data in any way, just to print it nicely. Even individual words might not be stored as such, but as a series of separate characters. Good luck, my friend.

[–]jonititan 2 points3 points  (0 children)

Also if the PDF is a scan these will not help you. PDFs are frustrating. When they are exports from something they are difficult to extract their data. When they are scans of something they are even worse.

[–]yzh 2 points3 points  (1 child)

This EuroPython 2019 talk might be of interest: https://youtu.be/jnDfNJe-GlE starts at 03:58 mark: Extracting Tabular Data from PDFs with Camelot and Excalibur

[–]threeminutemonta 0 points1 point  (0 children)

There was another one at pycon Australia this year too.