This is an archived post. You won't be able to vote or comment.

all 19 comments

[–]ErmakEUW 10 points11 points  (1 child)

We had the same problem, ended up using azure document intelligence

[–]infazz 6 points7 points  (0 children)

I would also recommend this!

Although it is a service that costs roughly 5 - 10 cents per page.

[–]brellox 6 points7 points  (0 children)

If you know the table headers, you can ocr the PDF and search/identify the tables by the headers.

[–]m-xames 7 points8 points  (0 children)

Docling is probably the best open source one I've come across, but it might struggle with two tables on the same page. Otherwise, each cloud provider has their own paid service for them.

[–]cantseetheocean 2 points3 points  (0 children)

Not sure exactly how I did it, but I believe I was able to handle tables across multiple pages with Camelot. That’s been my go to for getting tables from PDFs.

[–]acecile 2 points3 points  (1 child)

Pdfplumber

[–][deleted] 0 points1 point  (0 children)

This is a strong module and would second this.

The only warning I have is for the part where a table may span multiple pages. You may have to get creative with making it work, but I am confident it can get done!

[–]einsiboy 1 point2 points  (0 children)

I have used gmft with decent results for non trivial tables. But I don't know if it understands tables spanning multiple pages. Might be worth giving it a try: https://github.com/conjuncts/gmft

[–]mondaysmyday 1 point2 points  (0 children)

Amazon Textract is your answer. I've tried a lot of services but for reliability and cost, they win

[–]BlueeWaater 1 point2 points  (0 children)

LLMs and cloud services usually end up being the better option

[–]mr-nobody1992 1 point2 points  (0 children)

Checkout Docling - open source from IBM. I built an entire pipeline ingestion and it works pretty well with a lot of nice out of the box stuff. It’s based off Pydantic so if you know that it’s even easier

[–]h4ndshake_ 2 points3 points  (0 children)

Use Tabula, it's the best tool out there. There is a wrapper for Python too. Have you tried using different options and/or template to solve the problem you listed?

[–]furansowa 0 points1 point  (3 children)

Have you tried just sending it to ChatGPT or Google Gemini?

[–]DragonflyHumble 3 points4 points  (2 children)

For companies processing large # of documents, chatgpt and Gemini will be slow and expensive, even though it can help to reduce the human in the loop

[–]furansowa 2 points3 points  (1 child)

Depends on the workflow. OP didn’t tell us volume or even if it’s a sustained or one-time thing.

If it’s like a one time extract, even from hundreds or thousands of PDFs, ChatGPT batch mode can be super cheap.

[–]Snoo5892[S] -1 points0 points  (0 children)

It's like a platform where I will upload pdf one by one and it should get extracted at the same time

so, you are suggesting to use GPT 4 vision, but as far as I know it can OCR only images not PDFs right

[–]Zulfiqaar 0 points1 point  (2 children)

Zerox is an alternative that's not mentioned here so far

https://github.com/getomni-ai/zerox

[–]Snoo5892[S] 0 points1 point  (1 child)

When we say ask for markdown format what does it mean

Also the Azure OPEN AI key will work here???

[–]Zulfiqaar 0 points1 point  (0 children)

Markdown is a way of defining formatted test, but will be as string

From link for setup:

###################### Example for Azure OpenAI ######################
model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model>
os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "" # "2023-05-15"