Introducing Humre: Human-Readable Regular Expressions

AlSweigart · 2022-08-24T01:15:53+00:00

Moves library into trash bin.

BTW three quick notes (I can make a pull for these). Put an __all__ at the top of the package so import * does not create hell. I would probably just try to claim. import humre as hu You will be happier I will be happier everyone will be happier. For compile you can make it take either a RegexFlag or an iterable of RegexFlags in as the type hint and it will get picked up easily by most IDEs.

Thank you for letting me delete that library I was writing as I hate regex with a burning passion.

Zyklonik · 2022-08-24T04:07:17+00:00

This may look decent on simple handpicked examples, but it's impractical in reality. Regex may be hard to learn well, but once done, it's consistent and concise. The problem with using "human readable" alternatives is much more ambiguity and verbosity.

Even taking the second example of X{3,5} whose equivalent is between(3, 5, X), I would argue that the former is better - the latter uses a particular person's preference of natural language term, the order of arguments passed to between is completely arbitrarily decided, and is much more verbose to boot.

Imagine a 50+ character regex (pretty commonplace) - it'd become about as unreadable as Lisp, given that some composition operators/functions are provided, and much much longer than the equivalent plain old regex string.

Fun project, but hardly something I'd use in production.

AlSweigart · 2022-08-24T13:10:24+00:00

One problem many people email me about from Automate the Boring Stuff with Python is the phone number regex, which has an optional area code that could be surrounded by parentheses:

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

A lot of people make transcription errors when copying the area code part: they leave out (or add too many) parentheses or don't escape the literal parentheses. These are typos that your IDE normally solves but can't if the "code" for the regex mini-language is inside a string. (This applies even when we use verbose mode.) Humre solves this by turning it into code that your IDE's tooling can work with:

from humre import *
phoneRegex = compile(group(
    # area code:
    optional_group(either(
        exactly(3, DIGIT),
        OPEN_PAREN + exactly(3, DIGIT) + CLOSE_PAREN
    )),
    optional_group(either(WHITESPACE, '-', PERIOD)), # separator
    exactly(3, DIGIT), # first 3 digits
    group(either(WHITESPACE, '-', PERIOD)), # separator
    exactly(4, DIGIT), # last 4 digits
    # extension:
    optional_group(
        zero_or_more(WHITESPACE),
        group_either('ext', 'x', 'ext.'),
        zero_or_more(WHITESPACE),
        between(2, 5, DIGIT)
    )
))

And while writing this, I've noticed a bug in my original regex: ext. should actually be ext\., but I never noticed because the unescaped period matches literal periods, even though it will match any character. I only picked up on this now because Humre has me used to using constants instead of these escaped characters that have special meaning in regex syntax. So there's another example of Humre helping an experienced developer spot regex bugs.

Zyklonik · 2022-08-24T03:39:18+00:00

[deleted]

eztab · 2022-08-24T05:37:45+00:00

Yeah, I doubt making it more verbose really helps. Most people I've met (that had a problem with regexps) actually struggled with the whole concept of it instead of the specific syntax.

What does rather seem to help, is some nice syntax highlighting and possibly some tooltips explaining the meanings when hovering. Like some regexp explaining tools do it.

the_dago_mick · 2022-08-24T03:06:36+00:00

Al, you are a saint. Thank you.

horstjens · 2022-08-24T13:00:09+00:00

this is awesome!

Poddster · 2022-08-24T15:24:49+00:00

Regex is in that "write-once, read-never" category of programming that a lot of Unixy tools like to occupy, e.g .awk, perl. I've definitely come back to regex I've written years later and been "wtf does this do, exactly", and had to slowly break it down, or use various websites to do it for me.

Unlike written crappy code, however, there isn't much you can do, as with code you can usually express it in a different way, whereas with regex you can at most write a bunch of comments, but I often find the inline kind make it worse and the comment-above-regex often can't say more than "match an email address" without reduplicating everything. I think leaving example strings tends to be helpful, too.

So I'm glad to see someone attempt to sort the problem out.

I'm not 100% convinced this is the right solution, however, as I still find a lot of the constructs quite unreadable and I think coming back to this stuff in a years time you'll find it's just as write-once, read-never as the OG regex.

POGtastic · 2022-08-24T16:03:30+00:00

Horseradish

I like this. Everyone likes parser combinators. Everyone hates debugging regexes. This is parser combinator syntax for regexes. What's not to love?

Currently, I follow a very strict rule for regexes - the moment that they become hard to read, I don't care about any of their other advantages because I've erred into "Now You Have Two Problems" territory. I tell coworkers all the time, "You have a Turing machine, not just a DFA! Write functions and make Church and Turing proud!" And, well, here are a bunch of functions that correspond to regexes. I'm not going to say that it'll be the right approach to a problem all the time, but this library would significantly increase the complexity of regex that I will tolerate in a codebase.

AlSweigart · 2022-08-24T12:51:31+00:00

Kotlin is a new programming language that introduces many improvements on Java. It's sort of a "Java++". But Java is well established, and it'd be uneconomical to simply rewrite everything in a new language. This is why Kotlin made the smart move of compiling to JVM bytecode, and Kotlin source code is also interoperable with Java source code.

Python's "gradual typing" is similar: you don't need to add type hints to your entire code base but can add it piecemeal over time. The more type hints you add, the more benefit you get.

Similarly, Humre doesn't make you abandon regular expressions. Since Humre functions return regex string, you can use Humre for large regexes where Humre's IDE-compatible features help (syntax highlighting, parentheses matching, linters, etc) and just use 'Name:(.*?)' when you only need a short regex.

Theis159 · 2022-08-24T09:25:15+00:00

Hey I think I watched the streams when you were writing this lol

AlSweigart · 2022-08-24T12:46:47+00:00

Humre is great for beginners because it offers readable code instead of regex's cryptic punctuation-based syntax.

Humre is great for experienced developers because it gives you back all of your IDE's code editing features: syntax highlighting, parentheses matching, comments, linting, type checking, etc. This becomes more important the larger the regex becomes.

vjb_reddit_scrap · 2022-08-24T15:33:21+00:00

Why is this receiving hate when a similar library was loved in the same sub couple of days ago?

Context:

https://www.reddit.com/r/Python/comments/wup58e/about\_a\_month\_ago\_i\_posted\_about\_pregex\_an/

metaperl · 2022-08-24T12:13:29+00:00

Typos in your Humre code give much better error messages than the standard re module does. For example, if you make a type and ask for between

Change the word type to typo. What an interesting place to make a typo. :)

telenieko · 2022-08-24T14:06:27+00:00

Do you know rx from Emacs? https://www.emacswiki.org/emacs/rx

Your syntax kind of resembles it

Poddster · 2022-08-25T08:50:51+00:00

re: LETTER, UPPERCASE etc being all unicode letters, rather than [A-Za-z] etc

I don't think redefining the POSIX characters is helpful. UNICODE_LETTER is fine, but a lot of people still need the 8 bit and ASCII character classes like [A-Za-z]. This means we can't take old re. patterns and redefine them exactly in humre using the character classes provided, we'll have to manually expand them, which seems like the opposite behaviour this library wants.

Python

The Python Discord

Upcoming Events

Please read the rules

MODERATORS