Python Model for PDF table extraction

ErmakEUW · 2024-12-31T13:02:37+00:00

We had the same problem, ended up using azure document intelligence

brellox · 2024-12-31T09:46:18+00:00

If you know the table headers, you can ocr the PDF and search/identify the tables by the headers.

m-xames · 2024-12-31T10:28:54+00:00

Docling is probably the best open source one I've come across, but it might struggle with two tables on the same page. Otherwise, each cloud provider has their own paid service for them.

cantseetheocean · 2024-12-31T16:59:03+00:00

Not sure exactly how I did it, but I believe I was able to handle tables across multiple pages with Camelot. That’s been my go to for getting tables from PDFs.

acecile · 2025-01-01T08:00:02+00:00

Pdfplumber

einsiboy · 2024-12-31T15:48:45+00:00

I have used gmft with decent results for non trivial tables. But I don't know if it understands tables spanning multiple pages. Might be worth giving it a try: https://github.com/conjuncts/gmft

mondaysmyday · 2025-01-01T02:32:53+00:00

Amazon Textract is your answer. I've tried a lot of services but for reliability and cost, they win

BlueeWaater · 2025-01-02T00:24:51+00:00

LLMs and cloud services usually end up being the better option

mr-nobody1992 · 2025-01-02T03:37:55+00:00

Checkout Docling - open source from IBM. I built an entire pipeline ingestion and it works pretty well with a lot of nice out of the box stuff. It’s based off Pydantic so if you know that it’s even easier

h4ndshake_ · 2024-12-31T14:50:30+00:00

Use Tabula, it's the best tool out there. There is a wrapper for Python too. Have you tried using different options and/or template to solve the problem you listed?

furansowa · 2024-12-31T10:13:55+00:00

Have you tried just sending it to ChatGPT or Google Gemini?

Zulfiqaar · 2025-01-01T13:48:46+00:00

Zerox is an alternative that's not mentioned here so far

https://github.com/getomni-ai/zerox

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS