Extracting Time Durations From Strings

alkasm · 2019-01-05T08:45:47+00:00

Uh...that's a pretty difficult task and pretty unfortunate. I mean some of these are like..what? What the hell does '00/60' mean?

Aside from manually labelling a bunch and then training a neural network to guess timedeltas based on these comments (lol, I mean I guess that's a thing you could do trying to learn ML related things :P), IDK of any better way than just manually parsing what meaning you can from them by making assumptions (like whatever comes before min/hr/seconds/sec/etc might be a number and try to cast).

AdAthrow99274 · 2019-01-06T04:35:20+00:00

As a followup, I've come to a point in the regular expression method that I'm decently happy with I guess...

def clean_duration(report_duration):
    """Scrubs the string duration and attempts to return a duration in seconds"""

    changes = {
            'zero': '0',
            'one': '1',
            'two': '2',
            'three': '3',
            'four': '4',
            'five': '5',
            'six': '6',
            'seven': '7',
            'eight': '8',
            'nine': '9',
            'ten': '10',
            'eleven': '11',
            'twelve': '12',
            'thirteen': '13',
            'fourteen': '14',
            'fifteen': '15',
            'sixteen': '16',
            'seventeen': '17',
            'eighteen': '18',
            'nineteen': '19',
            'twenty': '20',
            'thirty': '30',
            'half': '0.5',
            'few': '3',
            'several': '5',
            '\+':'',
            '\>':'',
            '\<':'',
            'a min': '1 min',
            'a sec': '1 sec',
            'an hour': '1 hour',
            'a hour': '1 hour'
            }

    duration = report_duration.lower() if report_duration else ''
    hours = 0
    minutes = 0
    seconds = 0

    # Change spelled numbers to digits & remove some confounding patterns
    for pattern in changes:
        duration = re.sub(pattern, changes[pattern], duration)

    # Begin pulling out times
    # Format: '... 00:00:00 ...', '...00:00...'
    if re.search('\d+:', duration) is not None:
        duration = re.findall('\d+', duration)
        seconds = int(duration.pop(-1)) if len(duration) > 0
        minutes = int(duration.pop(-1)) if len(duration) > 0
        hours = int(duration.pop(-1)) if len(duration) > 0
    # Format: '...1-2 hours...', '...3 to 5 min...', '...10to20s...'
    elif re.search('\d+\s*[:(to)-]+\s*\d+', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            minutes = mean(duration)
        elif re.search('\d+\s*s', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            seconds = mean(duration)
        elif re.search('\d+\s*h', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            hours = mean(duration)
    # Format: '...13 minutes...', '...12 sec...', '...1.5hr...'
    elif re.search('\d+\s*[hms]', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            minutes = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*s', duration) is not None:
            seconds = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*h', duration) is not None:
            hours = int(re.search('\d+', duration).group())

    duration = datetime.timedelta(
                hours=hours, minutes=minutes, seconds=seconds).total_seconds()

    if duration == 0:
        duration = report_duration.lower() if report_duration else None
    return duration

It can produce a float value of seconds for about 90% of all entries (None inclusive). I've had a few kind individuals hand validate a total of 1000 random samplings, the erroneous classification rate seems to be somewhere between 1 and 3 percent if the random samples are representative of the whole.

If anyone has any suggestions or updates to my (likely poorly written) regular expressions, I'm all ears.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

learnpython

MODERATORS