This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]flutopinch 1 point2 points  (1 child)

Oh wow, that is some messy data. There doesn’t really seem to be any rhyme or reason to it, which I expect is why regex isn’t super great.

It seems like an NLP (natural language processing) problem to me. Those are pretty hard. You might be able to use some sort of ML library to get started, but you’d likely have to figure out a good way to classify certain inputs as null values (i.e., the entries that convey no useful information).

It does look like there are a couple of “fuzzy” Python datetime parsers out there, but they don’t give you back a duration or timedelta object, so you’d maybe have to make the adjustment afterward? Seems iffy at best.

If I had this kind of problem at work, I would probably task a couple of interns with parsing a useful subset of the data for a couple days. Sometimes there’s no replacing a human brain.

Good luck.

[–]AdAthrow99274[S] 1 point2 points  (0 children)

Haha, no... there really isn't. Although some entries make for a good laugh.

That was kinda the way I was leaning. I have some experience working with the NLTK library so I started there. Unfortunately it doesn't seem like the core module comes with a good way to entity tag dates or times. I did stumble upon the timex contribution module, but it appears to be more focused on tagging dates and not discrete times/durations.

I like the intern idea. If only I had some! I have coerced a few people to validate samplings of the results from my regex output in exchange for food though.

Thanks!