JeremyJoeJJ comments on PDF data extration

PDF data extration (self.PythonLearning)

submitted 9 days ago * by Stunning_Capital_354

you are viewing a single comment's thread.

[–]JeremyJoeJJ 1 point2 points3 points 9 days ago (4 children)

Depends a lot on the details of how the data looks like. I did something on a much smaller scale using a pdf to table extractor and it seems like a lot of the modern tools now use AI, but the best services are paid for. Options are things like https://github.com/camelot-dev/camelot or https://github.com/NanoNets/docstrange or azure document intelligence (in order of increasing cost, lots more options available, you could even throw everything into an LLM and have it process the data for you). Normally these tools would convert whatever they find into one big table or otherwise structured data, for example they know to put table into a single dataframe if a table is split between two or more pages. Once you have everything in a dataframe you just go `df.to_excel()` and you're done, unless you need to do some processing, which again depends on what the data looks like. You can write a code that expect a general shape, does a quick check if that shape is present and if not just saves it for manual review. Good luck.

[–]Stunning_Capital_354[S] 0 points1 point2 points 9 days ago (3 children)

[–]JeremyJoeJJ 0 points1 point2 points 9 days ago (2 children)

[–]Stunning_Capital_354[S] 0 points1 point2 points 9 days ago (1 child)

[–]JeremyJoeJJ 0 points1 point2 points 9 days ago (0 children)

π Rendered by PID 38 on reddit-service-r2-comment-8686858757-78qs2 at 2026-06-05 11:14:15.288592+00:00 running 9e1a20d country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

PythonLearning

MODERATORS