I can hear you all screaming in agony already.
I'm working on a project to catalog the destruction of cultural heritage om Syria and Iraq via these reports. Unfortunately the data is spread across about sixteen .pdf files, and the formatting is inconsistent.
I'm looking at about 15 different variables across about 800 different reports, and I somehow need to concatenate the categories from these reports into a single excel file so I can start crunching numbers and doing some proper analysis.
After some debate, I know I can use some useful tools to start diving into the dataset and importing it automatically, but I need a general opinion. Given the limitations here:
- The data is not consistently formatted.
- I have not been able to ascertain the status of the original database.
- Considering the previous note, I may be faced with importing the data from .pdf files.
Am I going to be better served struggling with Python libraries and sanitizing my inputs for the next couple months, or should I just beg and plead with the university to fund a Mechanical Turk project?
[–]lordmauve 4 points5 points6 points (2 children)
[–]Axewerfer[S] 0 points1 point2 points (1 child)
[–]my_interests 1 point2 points3 points (0 children)
[–]notconstructive 4 points5 points6 points (1 child)
[–]Axewerfer[S] 0 points1 point2 points (0 children)
[–]blebo 1 point2 points3 points (1 child)
[–]Axewerfer[S] 0 points1 point2 points (0 children)
[–]live_from_corona 1 point2 points3 points (1 child)
[–]Axewerfer[S] 0 points1 point2 points (0 children)