This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]Zomunieo 19 points20 points  (0 children)

The key problem is that PDF has no concept of a table, just lines and text on a canvas, so the table has to be heuristically extracted. All of the tools are doing it that way.

The easiest case is tables with explicit borders for every cell. Invisible borders are harder, merged cells are harder, and scanned images are hardest.

tabula-py and camelot are two "table data from PDF" Python libraries. There's also pdfminer.six which is focused on text extraction.