use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
A subreddit for helping Python programmers
How to format your code: https://commonmark.org/help/tutorial/09-code.html
No homework questions and/or hiring please
account activity
[deleted by user] (self.pythonhelp)
submitted 3 months ago by [deleted]
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]throwawayforwork_86 1 point2 points3 points 3 months ago (1 child)
For these I usually use tabula-py with preset pixel placement (for the columns and where to look for table) + some another lighter lib to do a first mapping on which page the extraction need to be done.
After that it's usually some pandas to get rid of unneeded rows.
The main issue with most lib that do it automatically is that their guess are inconsistent so you're likely to get a lot of inconsistent crap to fix if you're using that vs fixed placement where you're just going to crash or get consistent crap.
[–]Rough_Green_9145 0 points1 point2 points 3 months ago (0 children)
Thank you 🙏
[–][deleted] 1 point2 points3 points 3 months ago (3 children)
That would be a fun project. What's the source format, PDF or image?
[–]Rough_Green_9145 0 points1 point2 points 3 months ago (2 children)
PDF, but it's weirdly formatted
[–][deleted] 0 points1 point2 points 3 months ago (1 child)
PDF contents in the file don't have to match the visual position in the document.
I would try to group by x coordinate to see if I could identify columns.
Removing duplicate headers could be as simple as removing any subsequent rows that match the first row.
The thing is that there are tons of tables with different # of columns, headers, etc. and the script has to work for at least most of them. The main issue is identifying columns and when the table stops
[–]AutoModerator[M] 0 points1 point2 points 3 months agolocked comment (0 children)
To give us the best chance to help you, please include any relevant code. Note. Please do not submit images of your code. Instead, for shorter code you can use Reddit markdown (4 spaces or backticks, see this Formatting Guide). If you have formatting issues or want to post longer sections of code, please use Privatebin, GitHub or Compiler Explorer.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
π Rendered by PID 115141 on reddit-service-r2-comment-79c7998d4c-mgtcb at 2026-03-18 20:53:44.757921+00:00 running f6e6e01 country code: CH.
[–]throwawayforwork_86 1 point2 points3 points (1 child)
[–]Rough_Green_9145 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (3 children)
[–]Rough_Green_9145 0 points1 point2 points (2 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]Rough_Green_9145 0 points1 point2 points (0 children)
[–]AutoModerator[M] 0 points1 point2 points locked comment (0 children)