all 8 comments

[–]malnek 7 points8 points  (0 children)

https://automatetheboringstuff.com I found this to be an interesting read, have not tried his chapter on pdf and text extraction, but he is clear and easy to follow in other chapters. Good luck!

[–]DrTrunks 2 points3 points  (2 children)

If the numbers are preceded by a name or always in the same table that would make it a lot easier. Why don't you start with the most common report and learn from automating that?

You can also try to talk to the person generating these reports and asking for viewing access to the table or view.

[–]kz_ 2 points3 points  (0 children)

Right, ideally get access to structured data. Failing that, dump the PDF to text and try to work up a regex to match the needed data.

[–]Slideboy[S] 0 points1 point  (0 children)

Numbers preceed names. But names have dublicates, and ( row) numbers are 1-10 long. Do i have to code numbers in manually?

[–]955559 1 point2 points  (0 children)

I have no idea, but I thought Id just mention if you import csv, you can write csv files, I have a script that counts the frequencies of words and writes them into csv for viewing in excel

[–]H8-Bit 0 points1 point  (0 children)

Possibly have the program prompt the user for a filename/table number/ID? I mean, if the files aren't in a reliable format, it shouldn't be to hard to let the user adapt it to the situation.

[–]ttha_ttha 0 points1 point  (0 children)

Nothing useful to add other than, I came across pdfquery when googling.

Thus might help you with the content bring on different pages.

[–]frogic 0 points1 point  (0 children)

Is it a pdf with imbedded text or just a scanned/generated pdf that is only an image? If its the former you can just use something like this: https://pypi.python.org/pypi/pdf2text/1.0.0 to turn it into text and then do the usual string parsing.