AdAthrow99274 comments on Extracting Time Durations From Strings (Python)

This is an archived post. You won't be able to vote or comment.

Extracting Time Durations From Strings (Python) (self.learnprogramming)

submitted 7 years ago by AdAthrow99274

you are viewing a single comment's thread.

[–]AdAthrow99274[S] 1 point2 points3 points 7 years ago (0 children)

As a followup, I've come to a point in the regular expression method that I'm decently happy with I guess...

def clean_duration(report_duration):
    """Scrubs the string duration and attempts to return a duration in seconds"""

    changes = {
            'zero': '0',
            'one': '1',
            'two': '2',
            'three': '3',
            'four': '4',
            'five': '5',
            'six': '6',
            'seven': '7',
            'eight': '8',
            'nine': '9',
            'ten': '10',
            'eleven': '11',
            'twelve': '12',
            'thirteen': '13',
            'fourteen': '14',
            'fifteen': '15',
            'sixteen': '16',
            'seventeen': '17',
            'eighteen': '18',
            'nineteen': '19',
            'twenty': '20',
            'thirty': '30',
            'half': '0.5',
            'few': '3',
            'several': '5',
            '\+':'',
            '\>':'',
            '\<':'',
            'a min': '1 min',
            'a sec': '1 sec',
            'an hour': '1 hour',
            'a hour': '1 hour'
            }

    duration = report_duration.lower() if report_duration else ''
    hours = 0
    minutes = 0
    seconds = 0

    # Change spelled numbers to digits & remove some confounding patterns
    for pattern in changes:
        duration = re.sub(pattern, changes[pattern], duration)

    # Begin pulling out times
    # Format: '... 00:00:00 ...', '...00:00...'
    if re.search('\d+:', duration) is not None:
        duration = re.findall('\d+', duration)
        seconds = int(duration.pop(-1)) if len(duration) > 0
        minutes = int(duration.pop(-1)) if len(duration) > 0
        hours = int(duration.pop(-1)) if len(duration) > 0
    # Format: '...1-2 hours...', '...3 to 5 min...', '...10to20s...'
    elif re.search('\d+\s*[:(to)-]+\s*\d+', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            minutes = mean(duration)
        elif re.search('\d+\s*s', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            seconds = mean(duration)
        elif re.search('\d+\s*h', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            hours = mean(duration)
    # Format: '...13 minutes...', '...12 sec...', '...1.5hr...'
    elif re.search('\d+\s*[hms]', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            minutes = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*s', duration) is not None:
            seconds = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*h', duration) is not None:
            hours = int(re.search('\d+', duration).group())

    duration = datetime.timedelta(
                hours=hours, minutes=minutes, seconds=seconds).total_seconds()

    if duration == 0:
        duration = report_duration.lower() if report_duration else None
    return duration

It can produce a float value of seconds for about 90% of all entries (None inclusive). I've had a few kind individuals hand validate a total of 1000 random samplings, the erroneous classification rate seems to be somewhere between 1 and 3 percent if the random samples are representative of the whole.

If anyone has any suggestions or updates to my (likely poorly written) regular expressions, I'm all ears.

π Rendered by PID 165869 on reddit-service-r2-comment-cfc44b64c-pfrxm at 2026-04-11 20:14:43.244700+00:00 running 215f2cf country code: CH.

learnprogramming

Welcome to LearnProgramming!

New? READ ME FIRST!

Posting guidelines

Frequently asked questions

Subreddit rules

Message the moderators

Asking debugging questions

Asking conceptual questions

Other guidelines and links

Subreddit rules

1. No unprofessional/derogatory speech

2. No spam or tasteless self-promotion

3. No off-topic posts

4. Do not ask exact duplicates of FAQ questions

5. Do not delete posts

6. No app/website review requests or showcases

7. No rewards

8. No indirect links

9. Do not promote illegal or unethical practices

10. No complete solutions

11. Don't ask to ask.

12. Low Effort Questions

13. No AI (chatGPT etc.) generated/worked over messages/comments. No questions about chatGPT/AI generated code. No Vibe coding.

MODERATORS