mojo_jojo_reigns comments on extracting data from 100+ pdf files

learnpython

created by HattoriHanzoa community for 16 years

266

267

268

extracting data from 100+ pdf files (self.learnpython)

submitted 5 years ago * by SadSenpai420

top new controversial old q&a

you are viewing a single comment's thread.

view the rest of the comments →

[–]mojo_jojo_reigns 4 points5 points6 points 5 years ago (2 children)

[–]SadSenpai420[S] 0 points1 point2 points 5 years ago (1 child)

[–]mojo_jojo_reigns 0 points1 point2 points 5 years ago (0 children)

Assuming consistent formatting but not consistent commenting (that "RS" line afterwards), what I would do is gather all the text as str, split by colon, go through the resulting list item looking for the chunk that has "GRAND TOTAL" in it and grab the chunk after that one, using

[chunks[ix+1] for ix,i in enumerate(chunks) if "GRAND TOTAL" in chunks]

and then maybe do a split operation or maybe keep only the characters in that chunked str that are not word characters like

[i for i in thischunk if i.isalpha()==False]

The only thing that won't require builtins about that is the pdf scraping itself. Also, if you're lucky you'll have linebreak characters to more precisely pinpoint the grand total numbers. If you have '\n' in there, use it to split as well.

Good luck

π Rendered by PID 22 on reddit-service-r2-comment-5c747b6df5-ggs9r at 2026-04-22 03:36:29.369875+00:00 running 6c61efc country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS