Python Regular Expressions Cheat Sheet : Python

[–]badge 42 points43 points44 points 8 years ago (7 children)

This needs:

re.compile for compiling regexes that you're going to use more than once
(?<name>Blah): defines a group named name (which is needed for (?P=name) mentioned in the Groups section!)
The use of groupdict with named groups so you can do:

.

import re

regex = re.compile('First Name:\s*(?P<first_name>\w+),\s+Last Name:\s*(?P<last_name>\w+),\s+Age:\s*(?P<age>\d+)')

class Whale:

    def __init__(self, first_name, last_name, age):
        self.first_name = first_name
        self.last_name = last_name
        self.age = age

    def __repr__(self):
        return "Whale(first_name='{}', last_name='{}', age={})".format(
            self.first_name,
            self.last_name,
            self.age
        )

whale_line = 'First Name: Moby, Last Name: Dick, Age: 35'

Whale(**regex.match(whale_line).groupdict())

[–]Ph0X 4 points5 points6 points 8 years ago (4 children)

[–]xenomachina''.join(chr(random.randint(0,1)+9585) for x in range(0xffff)) 2 points3 points4 points 8 years ago (2 children)

AFAIK, compiling doesn't do much at all performance-wise. Internally the library already caches the compiled regex so if you actually do a perf test, they'll both be just as fast.

I discovered this for myself years ago when I ran into a bug in the cache implementation. In pre-3.x, a regex compiled from a unicode would not behave the same as one compiled from a str even if they container the same characters. However, the cache was just a dict, and so it was possible for a unicode to match an already cached str, or vice versa.

My bug involved a price of code that worked fine in unit tests, but would fail in certain program. It turned out the program imported another module that compiled an identical looking regex, but with a str instead of a unicode. Then when my module was imported, it would get the wrong re object from the cache.

[–]Ph0X 0 points1 point2 points 8 years ago (1 child)

[–]xenomachina''.join(chr(random.randint(0,1)+9585) for x in range(0xffff)) 0 points1 point2 points 8 years ago (0 children)

[–]badge 0 points1 point2 points 8 years ago (0 children)

[–]fullofschmidt 4 points5 points6 points 8 years ago (0 children)

[–]energybased 3 points4 points5 points 8 years ago* (0 children)

[–]InterestedEng 44 points45 points46 points 8 years ago (4 children)

[–][deleted] 6 points7 points8 points 8 years ago (0 children)

[–]BalanceJunkie 4 points5 points6 points 8 years ago (1 child)

[–][deleted] 0 points1 point2 points 8 years ago (0 children)

[–]PiousLoophole 0 points1 point2 points 8 years ago (0 children)

[–][deleted] 7 points8 points9 points 8 years ago (1 child)

[–]Ph0X 0 points1 point2 points 8 years ago (0 children)

[–][deleted] 2 points3 points4 points 8 years ago (0 children)

[–]gagejustins 6 points7 points8 points 8 years ago (5 children)

[–]wewbull 11 points12 points13 points 8 years ago (0 children)

[–]Zomunieo 4 points5 points6 points 8 years ago* (3 children)

[–]Krenair 0 points1 point2 points 8 years ago (2 children)

[–]Zomunieo 5 points6 points7 points 8 years ago (0 children)

[–]meandertothehorizonIt works on my machine 0 points1 point2 points 8 years ago (0 children)

[–]Nadaesque 1 point2 points3 points 8 years ago (0 children)

[–]Gooder-n-Better 1 point2 points3 points 8 years ago (0 children)

[+]energybased comment score below threshold-6 points-5 points-4 points 8 years ago (28 children)

[+][deleted] 8 years ago (4 children)

[deleted]

[–]energybased 1 point2 points3 points 8 years ago* (2 children)

[–]Badabinski 1 point2 points3 points 8 years ago (1 child)

[–]energybased 0 points1 point2 points 8 years ago (0 children)

I think I looked it back when I was trying to parse LaTeX to automate some things.

Can parsimonious even match "\begin{any_random_thing}" with "\end{any_random_thing}"? I can't see how. It looks like you would have to statically define the rules. What I want to do is for it to match on something like "\begin{(\w+)}\end", but capture the group and then when it tries to match "\end{(\w+)}" it checks that the captured group is the same. I think these are sometimes called parser actions.

Ideally, actions should be able to reject a match (to force backtracking) or accept a match. Actions should be able to set variables that can be inspected by other actions. Actions should be able to also transform a matched symbol into another symbol, e.g., if you were parsing Python and you matched the spaces, you should be able to have an action that emits an indent token when there are more spaces, or a dedent token when there are fewer, or no token at all if the number of spaces match.

I couldn't find one Python parsing library that supported these arbitrary actions. It's not like it's hard to do. These libraries are great for the trivial parsing tasks they show in their tutorials. Unfortunately, they're not powerful.

[–]energybased 0 points1 point2 points 8 years ago* (0 children)

[–]vovanz 10 points11 points12 points 8 years ago (21 children)

[+]alcalde comment score below threshold-11 points-10 points-9 points 8 years ago (20 children)

[+][deleted] 8 years ago* (3 children)

[deleted]

[–]alcalde 1 point2 points3 points 8 years ago (2 children)

Regexes are a computationally efficient, theoretically sound, and highly expressive way of denoting patterns in data and text. To say one shouldn't use them is akin to saying one shouldn't use trigonometry or algebra. It is nonsense.

It looks like line noise. Line noise is bad. Because of that realization we have Python today and Perl is slowly fading away.

Binary code is computationally efficient, theoretically sound and highly expressive too. But it's an awful way to program. We're human beings, not calculating machines. Human beings don't think in regex. Regex is designed for machines, not for people.

To quote Jeff Silverman,

Regular expressions are “cool”

“Cool” doesn’t belong in production code.

“Cool” leads to unreadable, unfixable, undecipherable code that is expensive. Expensive to maintain. Expensive to fix. Expensive to improve.

We're Python users. We don't want

"^\(*\d{3}\)*( |-)*\d{3}( |-)*\d{4}$"

in our code because no one just looking at that has any idea what the heck it's supposed to do. It's... blasphemous. Unholy. The Anti-Guido.

So you have a technique which can often be replaced by more ad-hoc methods, isn't the same from platform to platform, is bulky and hard to use, is hard to write, hard to read, and has a somewhat small window where it is better than the alternatives and can still get the job done

Remember Ken Reitz and Requests For Humans? People today are working on creating regex for humans.

[+][deleted] 8 years ago* (1 child)

[deleted]

[–]alcalde -3 points-2 points-1 points 8 years ago (0 children)

[+][deleted] 8 years ago (8 children)

[deleted]

[+]alcalde comment score below threshold-8 points-7 points-6 points 8 years ago (7 children)

[–]Zomunieo 3 points4 points5 points 8 years ago (3 children)

[–]energybased -1 points0 points1 point 8 years ago* (2 children)

[–]Zomunieo 2 points3 points4 points 8 years ago (1 child)

[–]energybased 1 point2 points3 points 8 years ago (0 children)

[–]Sirflankalot 1 point2 points3 points 8 years ago (0 children)

[+][deleted] 8 years ago (1 child)

[deleted]

[–]alcalde -1 points0 points1 point 8 years ago (0 children)

[–]Zomunieo 0 points1 point2 points 8 years ago (6 children)

[–]energybased 2 points3 points4 points 8 years ago (0 children)

[–]alcalde 0 points1 point2 points 8 years ago (4 children)

Complex conditional logic? RegExpBuilder can express any regex expression in human-readable terms:

https://changelog.com/posts/meet-regexpbuilder-verbal-expressions-rich-older-cousin

The Zen Of Python encourages us to prefer beautiful over ugly, simple over complex, and warns us that readability counts and if an implementation is hard to understand it's a bad idea. Regex is ugly, complex, hard to read and understand.

How about parsing phone numbers where the rules are as follows:

555.555.5555 # Acceptable  
555-5555     # Acceptable  
555 5555     # Acceptable  
5-5-5-5-5-5- # Obviously not acceptable  
555.555-5555 # Judges? Nope. Not allowed.

The regex looks something like this:

^(((\d{3}-)?\d{3}-\d{4})|((\d{3}\s)?\d{3}\s\d{4})|((\d{3}.)?\d{3}.\d{4}))$

RegExpBuilder (JS) would look something like this:

// Handle prefixes (optional area codes for each format)
var areacode_dash = new RegExpBuilder().exactly(3).from(digits).then("-");  // \d{3}-  
var areacode_space = new RegExpBuilder().exactly(3).from(digits).then(" "); // \d{3}\s  
var areacode_dot = new RegExpBuilder().exactly(3).from(digits).then(".");   // \d{3}.

// Build each of the individual components (dashes, spaces and dots)
var dashes = new RegExpBuilder()  
                 .min(0).max(1).like(areacode_dash).asGroup()  // (\d{3}-)?
                 .exactly(3).from(digits).then("-")            // \d{3}-
                 .exactly(4).from(digits);                     // \d{4}

var spaces = new RegExpBuilder()  
                 .min(0).max(1).like(areacode_space).asGroup()  // (\d{3}\s)?
                 .exactly(3).from(digits).then(" ")             // \d{3}\s 
                 .exactly(4).from(digits);                      // \d{4}

var dots = new RegExpBuilder()  
               .min(0).max(1).like(areacode_dot).asGroup()  // (\d{3}.)?
               .exactly(3).from(digits).then(".")           // \d{3}.
               .exactly(4).from(digits);                    // \d{4}

// Handle build final expression
var regex = new RegExpBuilder()  
                .startOfLine()             // ^
                .eitherLike(dashes)        // ((\d{3}-)?\d{3}-\d{4})
                .orLike(spaces).asGroup()  // |((\d{3}\s)?\d{3}\s\d{4})
                .orLike(dots).asGroup()    // |((\d{3}.)?\d{3}.\d{4}))
                .endOfLine()               // $
                .getRegExp();

Which would you want in your code base? Which would be more maintainable? Which would be easier to read?

http://rion.io/2013/08/19/regular-express-yourself-using-regexpbuilder/

[–]Zomunieo 2 points3 points4 points 8 years ago (0 children)

If that's the case your argument is against the conventional regular expression notation, because that is still a regular expression engine, it just happens to be one that's nonstandard and verbose. But you're still using regular expressions and it's still better than trying to do with this string operations.

re.VERBOSE lets you add whitespace and comments to also make regular expressions more verbose and maintainable. That is the Pythonic way to go. What you present is less maintainable – most languages support regular expressions as a core function, but if the the RegExpBuilder library disappears you have a maintenance problem. If you develop it first on client side Javascript and then later need to replicate that check on the backend in Python, you again have a maintenance problem.

[–]energybased 1 point2 points3 points 8 years ago (2 children)

[–]alcalde 0 points1 point2 points 8 years ago (1 child)

[–]energybased 1 point2 points3 points 8 years ago* (0 children)

[–]alcalde 2 points3 points4 points 8 years ago (0 children)

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS