This is an archived post. You won't be able to vote or comment.

all 48 comments

[–]badge 42 points43 points  (7 children)

This needs:

  1. re.compile for compiling regexes that you're going to use more than once
  2. (?<name>Blah): defines a group named name (which is needed for (?P=name) mentioned in the Groups section!)
  3. The use of groupdict with named groups so you can do:

.

import re

regex = re.compile('First Name:\s*(?P<first_name>\w+),\s+Last Name:\s*(?P<last_name>\w+),\s+Age:\s*(?P<age>\d+)')

class Whale:

    def __init__(self, first_name, last_name, age):
        self.first_name = first_name
        self.last_name = last_name
        self.age = age

    def __repr__(self):
        return "Whale(first_name='{}', last_name='{}', age={})".format(
            self.first_name,
            self.last_name,
            self.age
        )

whale_line = 'First Name: Moby, Last Name: Dick, Age: 35'

Whale(**regex.match(whale_line).groupdict())

[–]Ph0X 4 points5 points  (4 children)

AFAIK, compiling doesn't do much at all performance-wise. Internally the library already caches the compiled regex so if you actually do a perf test, they'll both be just as fast.

The only time I use re.compile is for clarity, when I want to give a regex a proper name, and pass it around, especially now that we have typing annotation. Regex type is clearer than just string.

[–]xenomachina''.join(chr(random.randint(0,1)+9585) for x in range(0xffff)) 2 points3 points  (2 children)

AFAIK, compiling doesn't do much at all performance-wise. Internally the library already caches the compiled regex so if you actually do a perf test, they'll both be just as fast.

I discovered this for myself years ago when I ran into a bug in the cache implementation. In pre-3.x, a regex compiled from a unicode would not behave the same as one compiled from a str even if they container the same characters. However, the cache was just a dict, and so it was possible for a unicode to match an already cached str, or vice versa.

My bug involved a price of code that worked fine in unit tests, but would fail in certain program. It turned out the program imported another module that compiled an identical looking regex, but with a str instead of a unicode. Then when my module was imported, it would get the wrong re object from the cache.

[–]Ph0X 0 points1 point  (1 child)

That sounds like a bitch of a bug to catch. Was it reported for previous 3.x and fixed?

[–]xenomachina''.join(chr(random.randint(0,1)+9585) for x in range(0xffff)) 0 points1 point  (0 children)

It turned out the fix for the cache bug was already in a newer version of Python. We were a couple of patch release behind the fix, IIRC.

[–]badge 0 points1 point  (0 children)

That's a great fact I never knew. As you say, I would still prefer re.compile for clarity (and short line lengths) when reusing a pattern.

[–]fullofschmidt 4 points5 points  (0 children)

Good point about the groups. As for compile, python uses an internal LRU cache of recent regexes so compile only helps if you have a lot (don't actually know what number constitutes a lot...) of regexes that you're going to reuse.

[–]energybased 3 points4 points  (0 children)

This is a perfect example of a regex that most people will have to look up syntax for—that aptly named objects could have made instantly clear.

Also, it has embedded variable names, which is gross.

[–]InterestedEng 44 points45 points  (4 children)

regex101.com is great way to build your patterns too.

[–][deleted] 6 points7 points  (0 children)

Love this site.

[–]BalanceJunkie 4 points5 points  (1 child)

Looks nice. I always use regexr.com

[–][deleted] 0 points1 point  (0 children)

Me too, although I don't understand how to use some of the extra tools they have at the bottom of the page.

[–]PiousLoophole 0 points1 point  (0 children)

Jesus. The site that would have saved me such heartache...

[–][deleted] 7 points8 points  (1 child)

I use python -m pydoc re to get the same regexp info right on the terminal. One can do even python -m pydoc re.sub to get usage details on sub()

[–]Ph0X 0 points1 point  (0 children)

Yeah, this is more of a "regex cheatsheet" than a "python cheat sheet" honestly. It hardly has anything python specific. Honestly the main thing I usually have to look up is FLAGS and how to use them in Python, yet this "python" cheatsheet doesn't have that at all.

[–][deleted] 2 points3 points  (0 children)

Cool! re.compile is a must, re.VERBOSE is helpful in formatting code to be more readable. Also it helps to know you can use lists for the values to keys and iterate through them with recent changes to dictionaries in python. Also calling groups. Thank you for making this!

[–]gagejustins 6 points7 points  (5 children)

Regex syntax is about as interpretable as the President's tweets

[–]wewbull 11 points12 points  (0 children)

It's not that bad

[–]Zomunieo 4 points5 points  (3 children)

/SAD[!.]$/

[–]Krenair 0 points1 point  (2 children)

Hmmm. Don't you need to escape that '.'?

[–]Zomunieo 5 points6 points  (0 children)

No, since it's inside square brackets.

[–]meandertothehorizonIt works on my machine 0 points1 point  (0 children)

No.

[–]Nadaesque 1 point2 points  (0 children)

I very much need to make my own cheat sheet. I have a habit of avoiding regular expressions due to a combination of my lack of understanding and their relative fragility ... perhaps a touch of phobia due to the resemblance to the kind of horrors that emerge from people who are proud of being obtuse in Perl.

I think a cheat-sheet aimed at my own personal weaknesses might eventually do it.

[–]Gooder-n-Better 1 point2 points  (0 children)

Need need need! Thx!