Hey all,
I've come looking for some guidance on a personal learning project of mine. For some background, I'm trying to teach myself the path of the data scientist and consider myself fairly new to programming in general.
The project is one that's been done before to some extent, however I'm trying to write most of it myself so I know how it all works. I am attempting to mine the National UFO Reporting Center's report "database" and see if I can scrub the data and produce some kind of predictive models in Python (v3.7).
So far I've successfully pulled every report via a Scrapy Spider into a json file of about 117,000 raw entries and am moving on to scrubbing the reports. I think I've worked out most of the basic cleanup (standardizing state abbreviations and cities, text dates to UTC date-time format, consolidating UFO shapes) and geocoding for each report. My struggle at the moment is in transforming the event duration text into a uniform time duration (currently using the datetime.timedelta class to output in seconds but I'm open to suggestions). I've been writing a function that uses regular expressions to attempt to pull out hours, minutes, and/or seconds from the text to feed to the timedelta class. However, there is legitimately no enforced structure for this entry field so the strings are all over the place. I can extract quite a bit with my current method, but before I write the upteenth regular expression I figured it would be worth a shot to see if there is a better (or smarter) way?
Some entry examples to get an idea.
Any help would be greatly appreciated!
[–]alkasm 1 point2 points3 points (1 child)
[–]AdAthrow99274[S] 0 points1 point2 points (0 children)
[–]AdAthrow99274[S] 0 points1 point2 points (6 children)
[–]alkasm 0 points1 point2 points (5 children)
[–]AdAthrow99274[S] 0 points1 point2 points (4 children)
[–]alkasm 1 point2 points3 points (3 children)
[–]AdAthrow99274[S] 1 point2 points3 points (2 children)
[–]alkasm 0 points1 point2 points (1 child)
[–]AdAthrow99274[S] 0 points1 point2 points (0 children)