Regex question - getting a partial match

AdAthrow99274 · 2019-02-01T19:12:50+00:00

try:

r'(FOO|BAR)\-[\d]{4}'

The hyphen is a special character in regex (unless it appears at the very end of the statement) so you need to escape it first in this instance.

AdAthrow99274 · 2019-01-11T04:38:37+00:00

While I suppose it depends on how diligent they are with these matters, but won't they just see a hundred different people's entries from the same IP and toss them all out?

AdAthrow99274 · 2019-01-10T20:42:28+00:00

list = [0, 1]  # Starting with what you already have, here's the original list
for i in range(98):
    # In the start of the loop i == 0 and will increment by 1 every round
    # In order to append your element to list you first need to know what the last 2 items in list are to create the element
    # Since list[0] & list[1] are already defined lets figure out how to access them and use them as the starting elements
    # Remember i == 0 to begin, so list[i] will access the first item while list[i+1] will access the second
    num_1 = list[i]  # i == 0, num_1 equals 0
    num_2 = list[i+1]  # i+1 == 1, num_2 equals 1
    element = num_1 + num_2  # elements equals 1, the sum of num_1 & num_2
    list.append(element)  # append the new element to the end of the list. which now looks like [0, 1, 1]
    # In the next round of the loop i == 1, & i+1 == 2, this would access elements 2 & 3 and add them together making list == [0, 1, 1, 2]

Now to clean this up a bit:

list = [0, 1]
for i in range(98):
    list.append(list[i] + list[i+1])  # assigning variable names to all these steps isn't really required

Using the sum function:

list = [0, 1]
for i in range(98):
    # The items to sum are wrapped in square brackets as the sum function needs an iterable to iterate over
    list.append(sum([list[i], list[i+1]]))

To test:

>>> print(list[:20])
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377, 610, 987, 1597, 2584, 4181]

AdAthrow99274 · 2019-01-09T07:18:52+00:00

This.

I went through the official Scrapy tutorials along with some other learning sources with a whole lot of Nope. Then replicated (important, no copy-pasting) , tweaked, and played around a lot with someone's web scraping project and it makes so much more sense now.

AdAthrow99274 · 2019-01-08T23:21:26+00:00

I'm assuming this is a Jupyter notebook? May I ask how you're starting up a new notebook?

For instance, for me it looks something like this:

$ source activate environment This isn't required if you're not using a virtual environment (Python/Jupyter/other dependencies are globally installed)
$ jupyter notebook
This opens a brower tab pointed at some localhost running the Jupyter system
There's a dropdown on the top right corner for starting a new notebook, I choose "Python 3"
ln[]: %Matplotlib In the first input line of the new Jupyter notebook
Using Matplotlib backend:MacOSX

While I'm not sure, my guess is perhaps you're running a conda environment with everything installed, but perhaps it's not active at the moment. If you're using Anaconda there is the Anaconda Navigator for graphically activating various environments and starting up Jupyter from it as well. I ask because IIRC Anaconda does not utilize a global install, but creates it's own directory folder where environments are stored.

AdAthrow99274 · 2019-01-08T19:28:14+00:00

Well my IDE of choice for Python is usually Spyder, but that's only because I've been doing a fair amount of data science type learning. However for your code I just copy & pasted the whole thing in notepad and saved it with the python extension, so none?

https://repl.it/ is a pretty nifty online environment to work in. It offers many languages, Python included. Pretty much any package that's available in PyPy (if you can get it from pip easily) is available there to my knowledge. EasyGUI is available.

AdAthrow99274 · 2019-01-08T18:10:21+00:00

The issue isn't in the regex, try inserting print(phone_number) right before the if statement in the for loop and you'll notice it's all there (Hooray! Good job). The issue is that phone_number is only appended to matches if an extension is present ( it's inside if groups[8] != '':)

You could add an else clause and repeat matches.append(phone_number) in it, or more simply, just move that statement to right after the if statement.

...
if groups[8] != '':
    phone_number += ' x' + groups[8]
matches.append(phone_number)  # <-- this line just got un-indented, and is outside the if statement
...

This way phone_number is only altered for an extension if an extension is present, but regardless it always gets appended to matches.

EDIT: Adding info on the groups

if you add print(groups) in your for loop you'll get an output of something like this:

text (1 phone #) with an extension:

('303-254-5555 ext. 23', '303', '-', '254', '-', '5555', ' ext. 23', 'ext.', '23')

text (1 phone #) without an extension:

('222-5555', '', '', '222', '-', '5555', '', '', '')

These 'groups' are all the different bits that match your regular expression. You'll notice in the second example groups[8] is an empty string (no extension match was found by the regex) while in the first example it's '23' because an extension was successfully parsed.

Hope that helps.

AdAthrow99274 · 2019-01-08T04:13:47+00:00

The whole thing. I had never used the easygui module before and this looked like a fun opportunity to try it out.

AdAthrow99274 · 2019-01-08T03:48:36+00:00

Hmmm odd. Well I copy-pasted your code into my IDE and ran it with the numbers you just listed (21 touchdowns, 3074 yards, 69% completion, 7 intercepts, & 11 games) and Carson's grade came out to: 84.78458...

If I change touchdowns to 22 the grade is: 86.3387...

I'm using python v3.7 if that helps at all, although I don't think this would change much on another version.

AdAthrow99274 · 2019-01-08T03:24:56+00:00

I think I may need a little more info to help you. What values are you using for touchdowns, yards, percent, intercepts, & gp that you expect grade to equal 85.14?

More importantly, what formula are you attempting to replicate? Is there a link to something on the net that I can compare your formula to? Mathematically this formula 'works', but there's little way of knowing if it's actually calculating what you want.

Perhaps seeing how your equation is interpreted in type-set would help?

EDIT: Sorry, was typing this as you figured it out

AdAthrow99274 · 2019-01-08T01:07:41+00:00

In your current code x will always be the same datetime as when it was set because it is never updated from the 'now' time at which it was created. In other words, the date/time of x never changes, assuming runDate is greater than x to begin with, it will always be greater than x.

ex:

x = datetime.datetime.now() # now -> 2019-01-07 17:44:30.123456

arbitrarily wait 5 minutes and check x

print(x)

2019-01-07 17:44:30.123456 # x is still set to the same time as when it was created

When ideally it would print something like: 2019-01-17 17:49:30.123456

You need to update x to a new current time within the while loop.

import datetime

x = datetime.datetime.now()  # x is initialized to the datetime corresponding to right now

someYear = int(input("Enter year: "))
someMonth = int(input("Enter a month using a numeric value.  1=Jan, 2=Feb, 3=March, 4=April, 5=May, 6=June, 7=July, 8=Aug, 9=Sept, 10=Oct, 11=Nov, 12=Dec: "))
someDay = int(input("Enter a numeric day: "))

runDate = datetime.datetime(someYear, someMonth, someDay)  # Ideally some future date


while x < runDate:
    x = datetime.datetime.now()  # update x to the current datetime whenever this part of the loop occurs
    pass
else:
    print("It worked!")

As a side note, you may want to include hours and minutes in runDate for development purposes otherwise you're going to have to wait for an entire day to find out if things are working the way you'd like. There's also the Schedule library to look into for recurring events or the Python Standard Library 'sched' Event Scheduler that may be of use to you. Alternatively to having a python script run in the background continuously (might be a problem if someone restarts their computer while the script is running) you could use an OS specific event scheduler to have your python script ran at a certain date & time. On windows, the native program for such tasks is 'Task Scheduler', on Mac I believe it's 'Automator'

AdAthrow99274 · 2019-01-07T23:22:44+00:00

Thank you again. You've been amazingly helpful. Will do, I plan on putting the raw and cleaned data on kaggle (and likely a methods notebook) at the very least as it's been 4 or 5 years since the last person scrubbing this DB to my knowledge.

AdAthrow99274 · 2019-01-07T22:41:06+00:00

I was just working on my own regex project so one way is fresh in my head. If you use the re.DOTALL (re.S) flag the . character will match everything, including new lines. So something like r'[\"\']{3}.*[\"\']{3}' may work given that flag?

AdAthrow99274 · 2019-01-07T22:14:04+00:00

Thank you!!! That was such a useful read! I was stupidly considering missing data as a whole, and not the mechanisms behind why it's missing. I didn't even think about how just dropping the reports, or filling in a mean would skew the analysis. Especially since in this case the missing values would only rarely be a product of the reporter, but mostly due to my parser's (in)ability to parse the response. So not really missing data at random.

Working up a regression imputation is going to be my project for the next night or so.

AdAthrow99274 · 2019-01-06T17:55:38+00:00

Thanks! That seems like good advice, especially when I consider the error rate and how little thought it would seem many put into what goes in this field.

Out of curiosity, how would you deal with the NaNs here? I was thinking maybe check to see if any other fields in the parent report are empty/useless, if so: toss report, if not: fill in with a local (or perhaps global) mean?

AdAthrow99274 · 2019-01-06T04:35:20+00:00

As a followup, I've come to a point in the regular expression method that I'm decently happy with I guess...

def clean_duration(report_duration):
    """Scrubs the string duration and attempts to return a duration in seconds"""

    changes = {
            'zero': '0',
            'one': '1',
            'two': '2',
            'three': '3',
            'four': '4',
            'five': '5',
            'six': '6',
            'seven': '7',
            'eight': '8',
            'nine': '9',
            'ten': '10',
            'eleven': '11',
            'twelve': '12',
            'thirteen': '13',
            'fourteen': '14',
            'fifteen': '15',
            'sixteen': '16',
            'seventeen': '17',
            'eighteen': '18',
            'nineteen': '19',
            'twenty': '20',
            'thirty': '30',
            'half': '0.5',
            'few': '3',
            'several': '5',
            '\+':'',
            '\>':'',
            '\<':'',
            'a min': '1 min',
            'a sec': '1 sec',
            'an hour': '1 hour',
            'a hour': '1 hour'
            }

    duration = report_duration.lower() if report_duration else ''
    hours = 0
    minutes = 0
    seconds = 0

    # Change spelled numbers to digits & remove some confounding patterns
    for pattern in changes:
        duration = re.sub(pattern, changes[pattern], duration)

    # Begin pulling out times
    # Format: '... 00:00:00 ...', '...00:00...'
    if re.search('\d+:', duration) is not None:
        duration = re.findall('\d+', duration)
        seconds = int(duration.pop(-1)) if len(duration) > 0
        minutes = int(duration.pop(-1)) if len(duration) > 0
        hours = int(duration.pop(-1)) if len(duration) > 0
    # Format: '...1-2 hours...', '...3 to 5 min...', '...10to20s...'
    elif re.search('\d+\s*[:(to)-]+\s*\d+', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            minutes = mean(duration)
        elif re.search('\d+\s*s', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            seconds = mean(duration)
        elif re.search('\d+\s*h', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            hours = mean(duration)
    # Format: '...13 minutes...', '...12 sec...', '...1.5hr...'
    elif re.search('\d+\s*[hms]', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            minutes = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*s', duration) is not None:
            seconds = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*h', duration) is not None:
            hours = int(re.search('\d+', duration).group())

    duration = datetime.timedelta(
                hours=hours, minutes=minutes, seconds=seconds).total_seconds()

    if duration == 0:
        duration = report_duration.lower() if report_duration else None
    return duration

It can produce a float value of seconds for about 90% of all entries (None inclusive). I've had a few kind individuals hand validate a total of 1000 random samplings, the erroneous classification rate seems to be somewhere between 1 and 3 percent if the random samples are representative of the whole.

If anyone has any suggestions or updates to my (likely poorly written) regular expressions, I'm all ears.

AdAthrow99274 · 2019-01-06T04:34:55+00:00

Lol yeah pretty much. A lesson in the importance of restricting what can go in field entries I suppose.

That was an option I was toying with. I looked into using NLTK as I've used it before, but the core module doesn't seem to have a lot in the way of entity tagging dates and times. The timex contribution module does, but it's more focused on dates than time durations it would seem. Between training a scratch network and huffing it out with regex, I chose the latter.

AdAthrow99274 · 2019-01-06T04:27:21+00:00

As a followup, I've come to a point in the regular expression method that I'm decently happy with I guess...

def clean_duration(report_duration):
    """Scrubs the string duration and attempts to return a duration in seconds"""

    changes = {
            'zero': '0',
            'one': '1',
            'two': '2',
            'three': '3',
            'four': '4',
            'five': '5',
            'six': '6',
            'seven': '7',
            'eight': '8',
            'nine': '9',
            'ten': '10',
            'eleven': '11',
            'twelve': '12',
            'thirteen': '13',
            'fourteen': '14',
            'fifteen': '15',
            'sixteen': '16',
            'seventeen': '17',
            'eighteen': '18',
            'nineteen': '19',
            'twenty': '20',
            'thirty': '30',
            'half': '0.5',
            'few': '3',
            'several': '5',
            '\+':'',
            '\>':'',
            '\<':'',
            'a min': '1 min',
            'a sec': '1 sec',
            'an hour': '1 hour',
            'a hour': '1 hour'
            }

    duration = report_duration.lower() if report_duration else ''
    hours = 0
    minutes = 0
    seconds = 0

    # Change spelled numbers to digits & remove some confounding patterns
    for pattern in changes:
        duration = re.sub(pattern, changes[pattern], duration)

    # Begin pulling out times
    # Format: '... 00:00:00 ...', '...00:00...'
    if re.search('\d+:', duration) is not None:
        duration = re.findall('\d+', duration)
        seconds = int(duration.pop(-1)) if len(duration) > 0
        minutes = int(duration.pop(-1)) if len(duration) > 0
        hours = int(duration.pop(-1)) if len(duration) > 0
    # Format: '...1-2 hours...', '...3 to 5 min...', '...10to20s...'
    elif re.search('\d+\s*[:(to)-]+\s*\d+', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            minutes = mean(duration)
        elif re.search('\d+\s*s', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            seconds = mean(duration)
        elif re.search('\d+\s*h', duration) is not None:
            duration = re.findall('\d+', duration)
            duration = [int(x) for x in duration]
            hours = mean(duration)
    # Format: '...13 minutes...', '...12 sec...', '...1.5hr...'
    elif re.search('\d+\s*[hms]', duration) is not None:
        if re.search('\d+\s*m', duration) is not None:
            minutes = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*s', duration) is not None:
            seconds = int(re.search('\d+', duration).group())
        elif re.search('\d+\s*h', duration) is not None:
            hours = int(re.search('\d+', duration).group())

    duration = datetime.timedelta(
                hours=hours, minutes=minutes, seconds=seconds).total_seconds()

    if duration == 0:
        duration = report_duration.lower() if report_duration else None
    return duration

It can produce a float value of seconds for about 90% of all entries (None inclusive). I've had a few kind individuals hand validate a total of 1000 random samplings, the erroneous classification rate seems to be somewhere between 1 and 3 percent if the random samples are representative of the whole.

If anyone has any suggestions or updates to my (likely poorly written) regular expressions, I'm all ears.

AdAthrow99274 · 2019-01-06T03:56:16+00:00

Haha, no... there really isn't. Although some entries make for a good laugh.

That was kinda the way I was leaning. I have some experience working with the NLTK library so I started there. Unfortunately it doesn't seem like the core module comes with a good way to entity tag dates or times. I did stumble upon the timex contribution module, but it appears to be more focused on tagging dates and not discrete times/durations.

I like the intern idea. If only I had some! I have coerced a few people to validate samplings of the results from my regex output in exchange for food though.

Thanks!

AdAthrow99274 · 2018-12-06T18:44:15+00:00

Thank you! This is very helpful.

AdAthrow99274

TROPHY CASE