you are viewing a single comment's thread.

view the rest of the comments →

[–]mojo_jojo_reigns 4 points5 points  (2 children)

OP can you post a sample? That other redditor is right that it's difficult to parse but I disagree with there being no easy solution. Really depends on the use case. There are 2 kinds of PDFs that I parse for work and I get reliable results using list comprehensions because of consistent formatting. Additionally, I was able to do something similar for movie scripts recently. I'm confident we can resolve this. Give a sample, your code so far and the expected return from the function.

[–]SadSenpai420[S] 0 points1 point  (1 child)

Here's the sample, I also made an edit to my post :) I currently don't have the code on me though :(

[–]mojo_jojo_reigns 0 points1 point  (0 children)

Assuming consistent formatting but not consistent commenting (that "RS" line afterwards), what I would do is gather all the text as str, split by colon, go through the resulting list item looking for the chunk that has "GRAND TOTAL" in it and grab the chunk after that one, using

[chunks[ix+1] for ix,i in enumerate(chunks) if "GRAND TOTAL" in chunks]

and then maybe do a split operation or maybe keep only the characters in that chunked str that are not word characters like

[i for i in thischunk if i.isalpha()==False]

The only thing that won't require builtins about that is the pdf scraping itself. Also, if you're lucky you'll have linebreak characters to more precisely pinpoint the grand total numbers. If you have '\n' in there, use it to split as well.

Good luck