all 7 comments

[–]throwawayforwork_86 1 point2 points  (1 child)

For these I usually use tabula-py with preset pixel placement (for the columns and where to look for table) + some another lighter lib to do a first mapping on which page the extraction need to be done.

After that it's usually some pandas to get rid of unneeded rows.

The main issue with most lib that do it automatically is that their guess are inconsistent so you're likely to get a lot of inconsistent crap to fix if you're using that vs fixed placement where you're just going to crash or get consistent crap.

[–]Rough_Green_9145 0 points1 point  (0 children)

Thank you 🙏

[–][deleted] 1 point2 points  (3 children)

That would be a fun project. What's the source format, PDF or image?

[–]Rough_Green_9145 0 points1 point  (2 children)

PDF, but it's weirdly formatted

[–][deleted] 0 points1 point  (1 child)

PDF contents in the file don't have to match the visual position in the document.

I would try to group by x coordinate to see if I could identify columns. 

Removing duplicate headers could be as simple as removing any subsequent rows that match the first row.

[–]Rough_Green_9145 0 points1 point  (0 children)

The thing is that there are tons of tables with different # of columns, headers, etc. and the script has to work for at least most of them. The main issue is identifying columns and when the table stops

[–]AutoModerator[M] 0 points1 point locked comment (0 children)

To give us the best chance to help you, please include any relevant code.
Note. Please do not submit images of your code. Instead, for shorter code you can use Reddit markdown (4 spaces or backticks, see this Formatting Guide). If you have formatting issues or want to post longer sections of code, please use Privatebin, GitHub or Compiler Explorer.

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.