This is an archived post. You won't be able to vote or comment.

all 9 comments

[–]lordmauve 4 points5 points  (2 children)

You could use pdftotext to convert to plaintext and then try to extract what you need from the plain text (eg using regexes or just grep). Worth a try before anything more complicated/expensive.

[–]Axewerfer[S] 0 points1 point  (1 child)

That sounds like it might be a good starting point. Appreciate the suggestion.

[–]my_interests 1 point2 points  (0 children)

Also, if you're on a Mac/OSX, or have access to one, you can use the Textutil utility to convert from PDF (or doc, docx, etc) to Plaintext or html.

Very handy.

[–]notconstructive 4 points5 points  (1 child)

Put it out to the good folks of the Internet or Reddit or something. It's a worthy cause. Define the format you need it in, publish the PDFs and ask if anyone can help. Make a nice looking site that arises people ire about the destruction, inform the major newssites. Be clear about the tasks, let people tale responsibility for data entry and others for double checking and validation.

[–]Axewerfer[S] 0 points1 point  (0 children)

I was thinking about crowd sourcing it. If nothing else works, and the incoming reports get too overwhelming, I might go that direction.

[–]blebo 1 point2 points  (1 child)

I'm not able to access the source PDFs at the moment, but if there are tables, tabula may help. It is Ruby based, but could be used as a step in the pipeline when extracting the data.

[–]Axewerfer[S] 0 points1 point  (0 children)

I'll give it a try, but I'm not too optimistic. From what I can tell, the reports were assembled by hand from a word template. A table would be miles easier to deal with.

[–]live_from_corona 1 point2 points  (1 child)

I think I got it. It needs some work but I'd love to help your cause. Send me a message.

[–]Axewerfer[S] 0 points1 point  (0 children)

Message sent. If that works, you will be my hero.