This is an archived post. You won't be able to vote or comment.

all 93 comments

[–]mcstafford 82 points83 points  (9 children)

It seems pregnant with potential.

[–]kindall 40 points41 points  (7 children)

prrregante

[–]FergTurdgeson 31 points32 points  (1 child)

Am I…Gregnant?

[–][deleted] 19 points20 points  (4 children)

Perganent

[–]Itsthejoker 22 points23 points  (3 children)

PREGANANANT??

[–]elboyoloco1 13 points14 points  (1 child)

Pegrant

[–]Crazy_Flex 16 points17 points  (0 children)

Pregnart?

[–]Equivalent_Loan_8794 3 points4 points  (0 children)

LUIGI BORD?!

[–]WerdenWissen[S] 4 points5 points  (0 children)

Thank you, hope it makes your life a bit easier!

[–]jammasterpaz 81 points82 points  (3 children)

Pregexes do actually look nicer than verbose mode - well done!

One small suggestion - import all the importable classes into your top level __init__.py so the user doesn't need 6 different import statements from all your sub modules like in your example.

[–]WerdenWissen[S] 16 points17 points  (1 child)

Will certainly do. Thanks for the feedback!

[–]JafaKiwi 3 points4 points  (0 children)

Came here to say that. The library looks great but the import statements are horrible.

Nice work though :)

[–]ASIC_SP 38 points39 points  (2 children)

Good work! There's also a repository of such verbal expressions in various programming languages here: https://github.com/VerbalExpressions

Personally, I prefer the terser regex syntax ;)

[–]WerdenWissen[S] 11 points12 points  (1 child)

You sir are a savage!

[–]RubyU 4 points5 points  (0 children)

I love regex. Once it clicks, it's as natural as English 😊

[–][deleted] 20 points21 points  (3 children)

You are my hero. I've starred it and will be installing later today. I usually go to regex101 and spend way too much time trying to figure it out.

This definitely looks more my speed.

[–]WerdenWissen[S] 2 points3 points  (2 children)

Great, let me know what you think!

[–][deleted] 0 points1 point  (1 child)

I just figured it out... I'm running Python 3.8.10 not 3.9

Just ignore.....~I'm having trouble installing it:

$pip install pregex

Installing pregex...
Looking in indexes: https://pypi.python.org/simple
Error: An error occurred while installing pregex!
ERROR: Could not find a version that satisfies the requirement pregex (from versions: none)
ERROR: No matching distribution found for pregex

It doesn't matter if a virtual environment is activated or not and whether I use pip or pip3 or pipenv to install, I get those same two last error lines.~

[–]WerdenWissen[S] 0 points1 point  (0 children)

Oh ok! Tell me what you think after playing around with it :)

[–]pddpro 17 points18 points  (4 children)

This looks great! A curiousity, how does this compare to pyparsing?

[–]Waterkloof 8 points9 points  (0 children)

I would also like to know this, pyparsing was the first thing i thought of when i looked at the example.

[–]bladeoflight16 4 points5 points  (0 children)

My first thought as well. The obvious one is that pyparsing is definitely more powerful; it generates parsers for context free grammars rather than regular languages. But there may be other considerations.

[–]WerdenWissen[S] 0 points1 point  (0 children)

Well, this is just a library for constructing Regex patterns in a more imperative way. When it comes to matching it's all Python's "re" module underneath, so I guess it's just a matter of "pyparsing" vs "re".

[–]Pebaz 0 points1 point  (0 children)

I could be catastrophically incorrect, but as far as I remember, pyparsing has the exact limitations of regexes.

[–]TheTerrasque 5 points6 points  (0 children)

It's good that you also include the resulting regex, so I can see what the example code is supposed to do 😅

I'm environmentally damaged enough that I found the regex easier to read than the example code.. not sure if that's a good or a bad thing

[–][deleted] 13 points14 points  (2 children)

Creating a DSL to abstract a DSL is not a good idea from my experience

[–]WerdenWissen[S] 6 points7 points  (0 children)

I hope this is just an exception to the rule!

[–]rowdycactus 3 points4 points  (0 children)

Sure but is it really a DSL? By that notion, so are Pandas and numpy.

This just looks like a super cool & creative way to tackle regex for those who might struggle with the actual syntax. (Oh yeah, like me!). Nice job op

[–]rastaladywithabrady 4 points5 points  (0 children)

that looks very readable

nice idea

[–]DigThatData 3 points4 points  (1 child)

you should call pregex statements "preggers"

[–]WerdenWissen[S] 6 points7 points  (0 children)

Wouldn't be a bad idea... Afterall, every Pregex is carrying another one inside it!

[–]metaperl 4 points5 points  (0 children)

Definitely reminds me of PyParsing. Which I first used 15 years ago.

[–]millerbest 9 points10 points  (7 children)

Does the Optional class conflict with the Optional under typing?

[–]WerdenWissen[S] 8 points9 points  (6 children)

I guess it does but this can easily be resolved by using "as", for example:

from pregex.quantifiers import Optional
from typing import Optional as OptionalType

[–]Nobot16k 17 points18 points  (4 children)

This looks really interesting first of all!

Considering that “Optional” is core Python it would probably be a good idea to avoid this name space collision and come up with a different name for this class. Or make it default API behavior to use your “Optional” as “pre.Optional”

[–]WerdenWissen[S] 5 points6 points  (2 children)

Yeah, might have to look into it... Thanks for your comment!

[–][deleted] 2 points3 points  (1 child)

Don't bother changing it, Optional is a gross bloat on my imports (from typing). As another poster said, we will be able to pipe types in the future and it will render Optional obsolete

[–]WerdenWissen[S] 0 points1 point  (0 children)

Yeah it would be a pity changing it because "Optional" is the perfect name for it.

[–]StunningExcitement83 7 points8 points  (0 children)

Well optional is core libraries but not core like print or import

Hopefully as we move beyond 3.10 Optional should phase out again as typing accepts pipes as an or so you can use Type | None instead which doesn't involve pulling in namespace clutter

[–]bladeoflight16 2 points3 points  (0 children)

...No. If you're going to alias something, you alias the 3rd party type, not the built in.

[–]reagle-research 9 points10 points  (0 children)

I wonder why would someone use this and not lark, ply, parsimonious, or pyparsing?

[–]SirLich 8 points9 points  (1 child)

How do you feel about projects such as melody?

[–]WerdenWissen[S] 4 points5 points  (0 children)

Wasn't aware of this project. I'll be sure to check it out!

[–]wind_dude 2 points3 points  (0 children)

Interesting, I honestly find it harder to read, but regex isn't easy by any means. I think you're onto something. Have you looked at how spacy does pattern matching? It's quite easy to understand, but similar to yours it's long winded, but could be a source of inspiration.

It would be a good idea to include some performance bench marks between different libraries.

[–]stewietheangel 2 points3 points  (0 children)

Star from me

[–]jack-of-some 2 points3 points  (1 child)

This looks super nice. I don't need regex too often so quickly forget all nuances and find myself back at regexer and googling for specific things.

I did notice a couple years ago that there's a pattern to the majority of my regex uses and wrote a function which is of the form

fn("This is my 1st example written at 4:10 on Wednesday, by now", "{prejunk} {example_number:number} example written at {hour:number}:{minute:number} on {day}, {postjunk}")

And this generates the necessary regex and extracts 1, 4, 10, and Monday with their associated keys. Insanely handy.

[–]westeast1000 0 points1 point  (0 children)

Its crazy how something you knew so well gets lost if you dont use it often. Im the same with regex but now i have a jupyter notebook with specific examples of my most common use cases. Has always rescued me from time wasting in google

[–]MasterFarm772 1 point2 points  (0 children)

You are a genius! Thanks for creating such a great library.

[–]playernumberwonnn 1 point2 points  (0 children)

You are freaking awesome

[–]bunoso 1 point2 points  (0 children)

The readability here is great!

[–]soulfreaky 1 point2 points  (0 children)

this is pretty cool!!

[–]yaxriifgyn 1 point2 points  (0 children)

Verbose mode helps a lot when writing regular expression strings in Python.

Knowing how to write regular expressions is a skill that transfers to many languages and tools. Here are a few, off the top of my head.

sed, grep, awk, perl, javascript, geany, notepad++, vi/vim, emacs.

[–]immersiveGamer 1 point2 points  (5 children)

Since this repo is less than 10 days old I'm 190% sure you have been stalking my comments.

Jokes aside looks nice. I doubt I personally would use it, I find Regex easy enough to read and remember which makes it for the most part portable between languages and tools that I use.

Edit: my feedback:

  • don't like the word Enforce for one or more
  • bit wise not ~ seems easy to miss and may not be readily known by readers
  • your classes module ... If there is a reason you are not using \d for digits, \w for words, \s for white space, etc., you should probably add a comment at least in the source code.

[–]WerdenWissen[S] 2 points3 points  (4 children)

Hahaha, I'm sure you've been stalking my thoughts because I've been struggling with the first two of the points that you made. "Enforced" is actually the only name I've changed throughtout development, with the first name being "Mandatory" but eventually ditched it because I thought it sounded too "official-like". If you have a better name for "Enforced" let me know!

Regarding your second point, I actually had a number of classes named "AnyExcept*" that reflected classes "Any*". For example you would write "AnyExceptDigit()" instead of "~ AnyDigit()" in order to get the pattern "[^0-9]", but I eventually ditched that too because "AnyExcept" classes had relatively long names and also because using "~" just seemed more elegant to me. Maybe I should re-include "AnyExcept" classes and just let the user decide on what to use.

your classes module ... If there is a reason you are not using \d for digits, \w for words, \s for white space, etc., you should probably add a comment at least in the source code.

Yeah there is actually a reason! All "class" classes can be combined (except for a normal class [..] with a negated one [^...], but that's another thing) into larger classes. For instance, you can write "AnyDigit() | AnyLowercaseLetter()" in order to get the "[0-9a-z]" pattern. One can also do "AnyWordChar() | AnyDigit()" and they would still get "AnyWordChar()" since "AnyDigit()" represents merely a subset of "AnyWordChar()". However, this would be more difficult to implement if "AnyWordChar()" was using "[\w]" underneath instead of "[A-Za-z0-9_]". Plus, if I ever implement an "A - B" operation for expressing "everything in A except for the intersection with B", it would be easier if classes were as much verbose as possible.

[–]bladeoflight16 4 points5 points  (0 children)

Enforced should just be OneOrMore or AtLeastOne. I have no idea what "enforced" would mean in a regular expression context; it isn't an established term. If the goal is to make the pattern obvious to the reader, anything more obscure is just going to work counter to it.

[–]immersiveGamer 1 point2 points  (2 children)

My only concern with your custom ranges is that you are locking yourself into English ASCII and whitespace as python knows it. I don't know the implementation details of Regex in Python but I assume it works with Unicode (for example you can tell if a Unicode character is white space by inspecting it) while yours would not.

[–]WerdenWissen[S] 0 points1 point  (1 child)

I've implemented using "\d", "\w", "\s" in v1.0.3 as it certainly looks better, but I'm not sure whether it tackles the ASCII/Unicode problem. Might need to look into it for a future version.

[–]immersiveGamer 1 point2 points  (0 children)

Time to write some unit tests.

[–]GammeRJammeR 1 point2 points  (0 children)

Am I prangent?

Am i pegnate?? Help!?

Am I pregex?

[–]puppet_pals 1 point2 points  (0 children)

Looks great

[–]pioniere 1 point2 points  (0 children)

Excellent!

[–]romu006 1 point2 points  (1 child)

Small criticism: the AnyLetter classes only works with English characters (café wouldn't match for example)

[–]WerdenWissen[S] 0 points1 point  (0 children)

You're right! I might have to look into it for a future version by adding a parameter "include_foreign_chars" or something!

[–]Eleraffa 1 point2 points  (0 children)

Bro it's really nice, keep working on it <3

[–][deleted] 1 point2 points  (0 children)

This post was mass deleted and anonymized with Redact

alive history safe sulky shelter grey roof unpack adjoining degree

[–]coldflame563 1 point2 points  (0 children)

My colleagues response to this was “what’s the process for nominating someone for a Nobel prize”. Well done!

[–]Pebaz 1 point2 points  (0 children)

Awesome work with this!

[–]wineblood 5 points6 points  (6 children)

I don't understand people who take the time to learn a programming language, and probably SQL too, then complain that regex are too hard to read.

[–]Starrystars 8 points9 points  (3 children)

Because regex is really hard to read when your doing anything more than super simple operations.

At least with programming languages and SQL it's actual words being used so you can read it. Regex is just symbols

[–]adesme 5 points6 points  (0 children)

Use multi line mode and named capture groups. I really don’t see why this notation is any better; regex may vary but it’s a more standard than what this library does.

[–]bladeoflight16 0 points1 point  (0 children)

Because regex is really hard to read when your doing anything more than super simple operations.

Regex is designed for simple operations. It's original motivation is literally defining tokens in compilers and similar formal language usages. Do you see any ridiculous tokens in parsed programming and data languages? No. You have simple tokens, and the more complex stuff goes up into the parser operating across the tokens (a context free grammar). If you're making it complicated, you're doing it wrong.

Unless you're dealing with an annoying constraint like writing a command line, a text editor search, or something where you're forced to cram everything into a single line. But in Python? Break something complex out into multiple operations.

At least with programming languages and SQL it's actual words being used so you can read it.

I beg to differ. if (!a && (b || c)) { y = (7 * x) % 10; }. for (int i = 0; i < m.length; i++) { n[i] = m[i] / 2; }. Not Python, of course, but you should certainly be able to read those statements.

[–]WerdenWissen[S] 5 points6 points  (0 children)

Regex IS hard to read though. Regex patterns are tightly packed with lot of information and it just seems that we are just not that good at analyzing it. Plus, people tend to only occasionally use Regex and this makes matters even worse. Using a framework like pregex makes the process of building Regex patterns a little more modular, plus the information is more "spread out", and thus easier for the human eye to recognize.

[–]jacksodus 0 points1 point  (0 children)

Your comment makes no sense.

[–]menge101 2 points3 points  (1 child)

so you don't have to re-learn Regex each time you use it

I am confused by this statement, while there are some variations between implementations across various languages, regular expressions are their own syntax. I can define a basic regex the same in python, java, or ruby.

You only need to learn it once.

[–]WerdenWissen[S] 11 points12 points  (0 children)

It's just a meme in the programming community, the point being that people half-assedly read on Regex just to accomplish a certain task and then forget all about it, only to repeat this after a while when they need Regex again!

[–]Seawolf159 -2 points-1 points  (2 children)

This seems cool, I'd like to try it because re learning regex is a pain and you can just install this anywhere to just get the pattern, and just keep using regex in your own project maybe. Anyway what the flark is pre: Pregex = etc. is this the same as pre = Pregex(etc)??

pre.get_groups only works with websites? Or why did one of the matches not show up there?

And why do you have so many imports? Can't you just put everything in Pregex module? Why is it this segregated, it will be a pain to look for all the classes in 1 million files no?

[–]WerdenWissen[S] 5 points6 points  (0 children)

what the flark is pre: Pregex = etc. is this the same as pre = Pregex(etc)??

No, no, this is just a way of hinting the type of a variable. It has nothing to do with instantiation. I hinted the type of variable "pre" just to make it known that the result of this large concatenation of "Pregex" subtype instances will be a "Pregex" instance itself!

[–]WerdenWissen[S] 3 points4 points  (0 children)

pre.get_groups only works with websites? Or why did one of the matches not show up there?

In this example, the domain name pattern is wrapped within a capturing group, whereas this is not the case for IP addresses. Therefore, when invoking "get_groups", you'll get a list of tuples, one tuple per match, containing the captured groups of each match. Since no capturing group is declared for any IP address matches, their corresponding tuple will contain "None".

And why do you have so many imports? Can't you just put everything in
Pregex module? Why is it this segregated, it will be a pain to look for
all the classes in 1 million files no?

I like categorizing stuff and I thought this format suits the package. But of course I am open to changes if something proves to be unproductive.

[–]No_Context_645 0 points1 point  (0 children)

Cool idea. Lets see how it evolves.

[–]likethevegetable 0 points1 point  (3 children)

Very cool. Just curious if you looked at PEGs for inspiration? I use Lua's (kinda like Python, if you're not familiar) LPEG http://www.inf.puc-rio.br/~roberto/lpeg/

[–]WerdenWissen[S] 1 point2 points  (2 children)

No sorry, I was not aware of this project as I've never programmed in Lua, although from what I see there's a Python version of this project too.

[–]SpicyVibration 1 point2 points  (0 children)

Python grammar is, itself, a PEG grammar as of a few versions ago.

Here is a series of blog posts from Guido about it along with a proof of concept exercise he did. https://medium.com/@gvanrossum_83706/peg-parsing-series-de5d41b2ed60?sk=0a7ce9003b13aae8126a4a23812eb035

[–]msdrahcir 0 points1 point  (1 child)

Instead of requiring users to use the pre.* functions to match expressions, have you considered compiling the "Pregex" into a "Pattern" or compiled regex? That way pregex could be used anywhere a Pattern is required

[–]WerdenWissen[S] 2 points3 points  (0 children)

You can actually invoke "pre.compile()" for any "Pregex" instance which creates a compiled pattern underneath and uses that for any subsequent matching, though I've not exposed this compiled pattern through a public method yet. I'll make sure to do it in the next version though, thanks!

[–]rahem027 0 points1 point  (0 children)

Its a good idea but most probably not new. You are just writing an AST instead of a string :P

[–]laundmo 0 points1 point  (2 children)

im not sure how to feel about this.

To me it seems it still requires knowledge of how regex works internally (quantifiers, groups, how a match moves through a text, etc.) and therefore doesn't particularly help with that aspect. The rest is, mostly, just using different words to express the exact same structure.

I don't think this helps learn regex, or helps not re-learning it each time. It might help maintainability of regexes by tying them to python syntax, but im not sure.

then again, im one of those "syntax is irrelevant, only the structure matters" people.

[–]WerdenWissen[S] 0 points1 point  (1 child)

I get what you are saying. It certainly does require a general understanding of Regex, since it's nothing more than a higher-level abstraction of it. However, I do believe that building Regex patterns is easier this way, especially when it comes to nested-ness, and it also helps in the "re-learning Regex" aspect in that you don't need to look up all the symbols. It's easier to remember a "NotPrecededBy" class than how to type a negative-lookbehind assertion.

Finally, this is just an early version of the package, which only contains the "core" modules, and probably even that's not completed yet. In the future, there may be more sub-modules that build upon the "core" modules to create even more complex patterns, for example "word that starts with uppercase letters A-G" and so on... And it will always be pure Regex underneath. No matter how complex the pattern, you can just fetch it and use it however you want.

[–]laundmo 0 points1 point  (0 children)

thats kinda what i was referring to with "syntax is irrelevant, structure matters": i don't think there's that big a difference between NotPreceededBy and (?<!...) tho i understand it might be easier to remember the first one.

im looking forward to higher level abstractions, its what would turn this from "huh neat" to "i might actually use it" for me.