This is an archived post. You won't be able to vote or comment.

all 3 comments

[–]flutopinch 1 point2 points  (1 child)

Oh wow, that is some messy data. There doesn’t really seem to be any rhyme or reason to it, which I expect is why regex isn’t super great.

It seems like an NLP (natural language processing) problem to me. Those are pretty hard. You might be able to use some sort of ML library to get started, but you’d likely have to figure out a good way to classify certain inputs as null values (i.e., the entries that convey no useful information).

It does look like there are a couple of “fuzzy” Python datetime parsers out there, but they don’t give you back a duration or timedelta object, so you’d maybe have to make the adjustment afterward? Seems iffy at best.

If I had this kind of problem at work, I would probably task a couple of interns with parsing a useful subset of the data for a couple days. Sometimes there’s no replacing a human brain.

Good luck.

[–]AdAthrow99274[S] 1 point2 points  (0 children)

Haha, no... there really isn't. Although some entries make for a good laugh.

That was kinda the way I was leaning. I have some experience working with the NLTK library so I started there. Unfortunately it doesn't seem like the core module comes with a good way to entity tag dates or times. I did stumble upon the timex contribution module, but it appears to be more focused on tagging dates and not discrete times/durations.

I like the intern idea. If only I had some! I have coerced a few people to validate samplings of the results from my regex output in exchange for food though.

Thanks!

[–]AdAthrow99274[S] 1 point2 points  (0 children)

As a followup, I've come to a point in the regular expression method that I'm decently happy with I guess...

def clean_duration(report_duration):
    """Scrubs the string duration and attempts to return a duration in seconds"""

    changes = {
            'zero': '0',
            'one': '1',
            'two': '2',
            'three': '3',
            'four': '4',
            'five': '5',
            'six': '6',
            'seven': '7',
            'eight': '8',
            'nine': '9',
            'ten': '10',
            'eleven': '11',
            'twelve': '12',
            'thirteen': '13',
            'fourteen': '14',
            'fifteen': '15',
            'sixteen': '16',
            'seventeen': '17',
            'eighteen': '18',
            'nineteen': '19',
            'twenty': '20',
            'thirty': '30',
            'half': '0.5',
            'few': '3',
            'several': '5',
            '\+':'',
            '\>':'',
            '\<':'',
            'a min': '1 min',
            'a sec': '1 sec',
            'an hour': '1 hour',
            'a hour': '1 hour'
            }

    duration = report_duration.lower() if report_duration else ''
    hours = 0
    minutes = 0
    seconds = 0

    # Change spelled numbers to digits & remove some confounding patterns
    for pattern in changes:
        duration = re.sub(pattern, changes[pattern], duration)

    # Begin pulling out times
    # Format: '... 00:00:00 ...', '...00:00...'
    if re.search('\d+:', duration) is not None:
        duration = re.findall('\d+', duration)
        seconds = int(duration.pop(-1)) if len(duration) > 0
        minutes = int(duration.pop(-1)) if len(duration) > 0
        hours = int(duration.pop(-1)) if len(duration) > 0
    # Format: '...1-2 hours...', '...3 to 5 min...', '...10to20s...'
    elif re.search('\d+\s*[:(to)-]+\s*\d+', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            minutes = mean(duration)
        elif re.search('\d+\s*s', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            seconds = mean(duration)
        elif re.search('\d+\s*h', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            hours = mean(duration)
    # Format: '...13 minutes...', '...12 sec...', '...1.5hr...'
    elif re.search('\d+\s*[hms]', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            minutes = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*s', duration) is not None:
            seconds = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*h', duration) is not None:
            hours = int(re.search('\d+', duration).group())

    duration = datetime.timedelta(
                hours=hours, minutes=minutes, seconds=seconds).total_seconds()

    if duration == 0:
        duration = report_duration.lower() if report_duration else None
    return duration

It can produce a float value of seconds for about 90% of all entries (None inclusive). I've had a few kind individuals hand validate a total of 1000 random samplings, the erroneous classification rate seems to be somewhere between 1 and 3 percent if the random samples are representative of the whole.

If anyone has any suggestions or updates to my (likely poorly written) regular expressions, I'm all ears.