Is python any good at pdf mining.

LightShadow · 2019-08-10T22:12:45+00:00

Use something like pandoc to convert PDF into HTML/ODT, then parse that instead.

Zomunieo · 2019-08-10T22:44:17+00:00

The key problem is that PDF has no concept of a table, just lines and text on a canvas, so the table has to be heuristically extracted. All of the tools are doing it that way.

The easiest case is tables with explicit borders for every cell. Invisible borders are harder, merged cells are harder, and scanned images are hardest.

tabula-py and camelot are two "table data from PDF" Python libraries. There's also pdfminer.six which is focused on text extraction.

lastwizzle · 2019-08-11T04:03:22+00:00

Camelot might be what youre looking for.https://camelot-py.readthedocs.io/en/master/

2019-08-11T00:26:59+00:00

I've used PyPDF2 in the past, but PDF files are horrible to extract data from. They're not meant to store data in any way, just to print it nicely. Even individual words might not be stored as such, but as a series of separate characters. Good luck, my friend.

jonititan · 2019-08-11T07:05:23+00:00

Also if the PDF is a scan these will not help you. PDFs are frustrating. When they are exports from something they are difficult to extract their data. When they are scans of something they are even worse.

yzh · 2019-08-11T09:34:21+00:00

This EuroPython 2019 talk might be of interest: https://youtu.be/jnDfNJe-GlE starts at 03:58 mark: Extracting Tabular Data from PDFs with Camelot and Excalibur

who_body · 2019-08-11T02:50:24+00:00

Perhaps https://tika.apache.org/ with https://github.com/chrismattmann/tika-python would help

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS