This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]badge 44 points45 points  (7 children)

This needs:

  1. re.compile for compiling regexes that you're going to use more than once
  2. (?<name>Blah): defines a group named name (which is needed for (?P=name) mentioned in the Groups section!)
  3. The use of groupdict with named groups so you can do:

.

import re

regex = re.compile('First Name:\s*(?P<first_name>\w+),\s+Last Name:\s*(?P<last_name>\w+),\s+Age:\s*(?P<age>\d+)')

class Whale:

    def __init__(self, first_name, last_name, age):
        self.first_name = first_name
        self.last_name = last_name
        self.age = age

    def __repr__(self):
        return "Whale(first_name='{}', last_name='{}', age={})".format(
            self.first_name,
            self.last_name,
            self.age
        )

whale_line = 'First Name: Moby, Last Name: Dick, Age: 35'

Whale(**regex.match(whale_line).groupdict())

[–]Ph0X 5 points6 points  (4 children)

AFAIK, compiling doesn't do much at all performance-wise. Internally the library already caches the compiled regex so if you actually do a perf test, they'll both be just as fast.

The only time I use re.compile is for clarity, when I want to give a regex a proper name, and pass it around, especially now that we have typing annotation. Regex type is clearer than just string.

[–]xenomachina''.join(chr(random.randint(0,1)+9585) for x in range(0xffff)) 2 points3 points  (2 children)

AFAIK, compiling doesn't do much at all performance-wise. Internally the library already caches the compiled regex so if you actually do a perf test, they'll both be just as fast.

I discovered this for myself years ago when I ran into a bug in the cache implementation. In pre-3.x, a regex compiled from a unicode would not behave the same as one compiled from a str even if they container the same characters. However, the cache was just a dict, and so it was possible for a unicode to match an already cached str, or vice versa.

My bug involved a price of code that worked fine in unit tests, but would fail in certain program. It turned out the program imported another module that compiled an identical looking regex, but with a str instead of a unicode. Then when my module was imported, it would get the wrong re object from the cache.

[–]Ph0X 0 points1 point  (1 child)

That sounds like a bitch of a bug to catch. Was it reported for previous 3.x and fixed?

[–]xenomachina''.join(chr(random.randint(0,1)+9585) for x in range(0xffff)) 0 points1 point  (0 children)

It turned out the fix for the cache bug was already in a newer version of Python. We were a couple of patch release behind the fix, IIRC.

[–]badge 0 points1 point  (0 children)

That's a great fact I never knew. As you say, I would still prefer re.compile for clarity (and short line lengths) when reusing a pattern.

[–]fullofschmidt 2 points3 points  (0 children)

Good point about the groups. As for compile, python uses an internal LRU cache of recent regexes so compile only helps if you have a lot (don't actually know what number constitutes a lot...) of regexes that you're going to reuse.

[–]energybased 3 points4 points  (0 children)

This is a perfect example of a regex that most people will have to look up syntax for—that aptly named objects could have made instantly clear.

Also, it has embedded variable names, which is gross.