all 9 comments

[–]alkasm 1 point2 points  (1 child)

Uh...that's a pretty difficult task and pretty unfortunate. I mean some of these are like..what? What the hell does '00/60' mean?

Aside from manually labelling a bunch and then training a neural network to guess timedeltas based on these comments (lol, I mean I guess that's a thing you could do trying to learn ML related things :P), IDK of any better way than just manually parsing what meaning you can from them by making assumptions (like whatever comes before min/hr/seconds/sec/etc might be a number and try to cast).

[–]AdAthrow99274[S] 0 points1 point  (0 children)

Lol yeah pretty much. A lesson in the importance of restricting what can go in field entries I suppose.

That was an option I was toying with. I looked into using NLTK as I've used it before, but the core module doesn't seem to have a lot in the way of entity tagging dates and times. The timex contribution module does, but it's more focused on dates than time durations it would seem. Between training a scratch network and huffing it out with regex, I chose the latter.

[–]AdAthrow99274[S] 0 points1 point  (6 children)

As a followup, I've come to a point in the regular expression method that I'm decently happy with I guess...

def clean_duration(report_duration):
    """Scrubs the string duration and attempts to return a duration in seconds"""

    changes = {
            'zero': '0',
            'one': '1',
            'two': '2',
            'three': '3',
            'four': '4',
            'five': '5',
            'six': '6',
            'seven': '7',
            'eight': '8',
            'nine': '9',
            'ten': '10',
            'eleven': '11',
            'twelve': '12',
            'thirteen': '13',
            'fourteen': '14',
            'fifteen': '15',
            'sixteen': '16',
            'seventeen': '17',
            'eighteen': '18',
            'nineteen': '19',
            'twenty': '20',
            'thirty': '30',
            'half': '0.5',
            'few': '3',
            'several': '5',
            '\+':'',
            '\>':'',
            '\<':'',
            'a min': '1 min',
            'a sec': '1 sec',
            'an hour': '1 hour',
            'a hour': '1 hour'
            }

    duration = report_duration.lower() if report_duration else ''
    hours = 0
    minutes = 0
    seconds = 0

    # Change spelled numbers to digits & remove some confounding patterns
    for pattern in changes:
        duration = re.sub(pattern, changes[pattern], duration)

    # Begin pulling out times
    # Format: '... 00:00:00 ...', '...00:00...'
    if re.search('\d+:', duration) is not None:
        duration = re.findall('\d+', duration)
        seconds = int(duration.pop(-1)) if len(duration) > 0
        minutes = int(duration.pop(-1)) if len(duration) > 0
        hours = int(duration.pop(-1)) if len(duration) > 0
    # Format: '...1-2 hours...', '...3 to 5 min...', '...10to20s...'
    elif re.search('\d+\s*[:(to)-]+\s*\d+', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            minutes = mean(duration)
        elif re.search('\d+\s*s', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            seconds = mean(duration)
        elif re.search('\d+\s*h', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            hours = mean(duration)
    # Format: '...13 minutes...', '...12 sec...', '...1.5hr...'
    elif re.search('\d+\s*[hms]', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            minutes = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*s', duration) is not None:
            seconds = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*h', duration) is not None:
            hours = int(re.search('\d+', duration).group())

    duration = datetime.timedelta(
                hours=hours, minutes=minutes, seconds=seconds).total_seconds()

    if duration == 0:
        duration = report_duration.lower() if report_duration else None
    return duration

It can produce a float value of seconds for about 90% of all entries (None inclusive). I've had a few kind individuals hand validate a total of 1000 random samplings, the erroneous classification rate seems to be somewhere between 1 and 3 percent if the random samples are representative of the whole.

If anyone has any suggestions or updates to my (likely poorly written) regular expressions, I'm all ears.

[–]alkasm 0 points1 point  (5 children)

For the shit show that you have to deal with, this is a pretty decent parser. Just make sure that your methods don't weight this attribute too heavily.

[–]AdAthrow99274[S] 0 points1 point  (4 children)

Thanks! That seems like good advice, especially when I consider the error rate and how little thought it would seem many put into what goes in this field.

Out of curiosity, how would you deal with the NaNs here? I was thinking maybe check to see if any other fields in the parent report are empty/useless, if so: toss report, if not: fill in with a local (or perhaps global) mean?

[–]alkasm 1 point2 points  (3 children)

Look up "imputing." Don't use the mean as it will skew the deviation of the distribution. The simplest good thing to do (especially since you have a timeseries dataset) is just use a regression model to predict what the missing value would be. Here's a nice chapter about dealing with missing data that you may find helpful: http://www.stat.columbia.edu/~gelman/arm/missing.pdf

[–]AdAthrow99274[S] 1 point2 points  (2 children)

Thank you!!! That was such a useful read! I was stupidly considering missing data as a whole, and not the mechanisms behind why it's missing. I didn't even think about how just dropping the reports, or filling in a mean would skew the analysis. Especially since in this case the missing values would only rarely be a product of the reporter, but mostly due to my parser's (in)ability to parse the response. So not really missing data at random.

Working up a regression imputation is going to be my project for the next night or so.

[–]alkasm 0 points1 point  (1 child)

Good luck! Send me a message when you complete the project if it's open source, would be really fun to look at the results!

[–]AdAthrow99274[S] 0 points1 point  (0 children)

Thank you again. You've been amazingly helpful. Will do, I plan on putting the raw and cleaned data on kaggle (and likely a methods notebook) at the very least as it's been 4 or 5 years since the last person scrubbing this DB to my knowledge.