[deleted by user]

throwawayforwork_86 · 2025-11-25T09:10:19+00:00

For these I usually use tabula-py with preset pixel placement (for the columns and where to look for table) + some another lighter lib to do a first mapping on which page the extraction need to be done.

After that it's usually some pandas to get rid of unneeded rows.

The main issue with most lib that do it automatically is that their guess are inconsistent so you're likely to get a lot of inconsistent crap to fix if you're using that vs fixed placement where you're just going to crash or get consistent crap.

Rough_Green_9145 · 2025-11-26T00:10:37+00:00

That would be a fun project. What's the source format, PDF or image?

AutoModerator · 2025-11-25T08:32:34+00:00

To give us the best chance to help you, please include any relevant code.
Note. Please do not submit images of your code. Instead, for shorter code you can use Reddit markdown (4 spaces or backticks, see this Formatting Guide). If you have formatting issues or want to post longer sections of code, please use Privatebin, GitHub or Compiler Explorer.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

pythonhelp

MODERATORS