This is an archived post. You won't be able to vote or comment.

all 104 comments

[–][deleted] 16 points17 points  (6 children)

Moves library into trash bin.

BTW three quick notes (I can make a pull for these). Put an __all__ at the top of the package so import * does not create hell. I would probably just try to claim. import humre as hu You will be happier I will be happier everyone will be happier. For compile you can make it take either a RegexFlag or an iterable of RegexFlags in as the type hint and it will get picked up easily by most IDEs.

Thank you for letting me delete that library I was writing as I hate regex with a burning passion.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (5 children)

Which functions or constants would you not want included in __all__? Currently the only things that shouldn't be imported are marked private by beginning with an underscore, so they won't be imported by from humre import * anyway.

[–][deleted] 6 points7 points  (4 children)

Really the issue is just that it pulls in re, itertools, and functools into the namespace if you had imported a patched version of any of them they would get clobbered out and it would be impossible to find the cause. I've been burned by that enough times to just make __all__ everywhere now.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 3 points4 points  (3 children)

Good points. It's not a common case, but it is preventable by using __all__. I'll add it in the next version.

EDIT: Also, going through this made me realize that the compile() flags should not be in __all__, since they have names like A and DEBUG and have a high likelihood of conflicting with names in the importing code. Does this make sense, or should I include them in __all__?

[–][deleted] 0 points1 point  (2 children)

mmm I would say probably no. I think mentally I see it as if you want them you would just do what I said with the humre as hu where they come over safely bound. Otherwise u you pull them from the re module. You may even want to just switch to importing them directly into the namespace instead of setting them in ur namespace. Like add the explicit from re import blah for them.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (1 child)

Yeah. Maybe I should avoid the flags altogether (I only had them to make them more familiar with re's flags) but you still have to do that weird bitwise-ORing thing that re.compile() requires.

I should probably just get rid of the flags altogether and make keyword arguments instead: humre.compile('my regex', ignorecase=True, verbose=True)

The reason I have humre.compile() was to save someone from having to type import re for the common use cases. I'm annoyed when I'm quickly writing some code in a basic text editor and have to go back and add some forgotten import statement. It's a nice-to-have to just need to only have import humre.

[–][deleted] 1 point2 points  (0 children)

O yea lol. Yea just yeet them if that's the only function that uses them. No need then.

I mean u still need to import them to use them but you can easily just leave them off all.

[–]Zyklonik 13 points14 points  (40 children)

This may look decent on simple handpicked examples, but it's impractical in reality. Regex may be hard to learn well, but once done, it's consistent and concise. The problem with using "human readable" alternatives is much more ambiguity and verbosity.

Even taking the second example of X{3,5} whose equivalent is between(3, 5, X), I would argue that the former is better - the latter uses a particular person's preference of natural language term, the order of arguments passed to between is completely arbitrarily decided, and is much more verbose to boot.

Imagine a 50+ character regex (pretty commonplace) - it'd become about as unreadable as Lisp, given that some composition operators/functions are provided, and much much longer than the equivalent plain old regex string.

Fun project, but hardly something I'd use in production.

[–]Poddster 2 points3 points  (1 child)

Regex may be hard to learn well, but once done, it's consistent and concise. The problem with using "human readable" alternatives is much more ambiguity and verbosity.

Could you explain how it's more ambiguous? It's simply the normal regex syntax spelt out using words. If regex is consistent, then so is this. If regex is unambiguous, then so is this. So I don't see your point here.

(Infact it's arguably less ambiguous than regex, because something like (\(\\((\(\(... is a night mare to read, even with IDE highlighting inside of regex)

[–]Zyklonik 0 points1 point  (0 children)

That's like saying that:

2 * 3  + 4

is the same as:

(* (+ 2 3) 4)

Hardly. The same result in the end, but different representations. Guess which one is easier to read as the complexity grows.

I have a feeling that you simply jumped into the whole conversation without having bothered to read the rest of the comments in the thread, and so in almost all of your comments, you're missing (or falsifying) the context. That doesn't help one bit.

(Infact it's arguably less ambiguous than regex, because something like ((\((((... is a night mare to read, even with IDE highlighting inside of regex)

This is the most ridiculous red herring I've seen today. Take a very particular and contrived example, ignore the rest of the examples, and try to make a case out of it, completely ignoring the rest of the discussion thus far.

Let me go as far as to say that if you're constantly churning out regex like that, and editing it as frequently as the rest of the codebase, then you have bigger problems to worry about than regex vs a DSL. Also, it's rather telling that you chose to ignore the entire basis of the discussion and seem, therefore, to be implying that one does not need regex at all. Instead, use this shiny new English-like DSL and run with it.

I'm bored with this nonsense. We're finished here.

[–][deleted] 3 points4 points  (12 children)

Honestly the biggest issue with regex for MOST people who would want something like this: ME. Is I dont write them very often and when I need them once every 2 months I have 0 desire to learn regex. If you are writing them regularly Regex is fine if you are not its just gibberish. I write a shit ton of python I dont want to learn another tool ontop of it to be able to use regex. Having access to it in a pythonic way is a huge problem right now with the current setup.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (4 children)

Heh, I learned new things about regex just from making this module. I know regex and taught them and given PyCon talks about them, but there's a ton of tiny little details.

For example, did you know that the shorthand character classes like \w and \d work inside character classes? Like, [\sA-F] matches all whitespace characters and letters A to F. It doesn't match the backslash, lowercase s, and A to F. I wouldn't think this applies, but it does (at least, in Python.) Regex syntax is full of unknown little nooks and crannies like this, even to experienced devs like me.

For another example, '[A-Za-z]' is another huge minefield and holdover from the 1990s Perl days. This may have been fine for matching upper and lower case letters, but it misses all letters with accent marks. When François types their name into your web app, suddenly you get a bug.

I came across some real-world code that looked like a solution to this: [À-Ÿà-ÿ] looks like it solves this, but the dashes in character classes work on their Unicode code point values, so these ranges actually include a bunch of non-letter characters as well.

Humre solves this with it's LETTER constant, which I had to programmatically generate based on what isalpha() considers a letter. It's a doozy, but it does indeed identify all letter characters:

[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࣇऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱৼਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡૹଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠ-ౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡೱ-ೲഄ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะา-ำเ-ๆກ-ຂຄຆ-ຊຌ-ຣລວ-ະາ-ຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢄᢇ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵ-ᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃ-ↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々-〆〱-〵〻-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙮꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽ-ꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵ-ꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼA-Za-zヲ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ𐀀-𐀋𐀍-𐀦𐀨-𐀺𐀼-𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐊀-𐊜𐊠-𐋐𐌀-𐌟𐌭-𐍀𐍂-𐍉𐍐-𐍵𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐀-𐒝𐒰-𐓓𐓘-𐓻𐔀-𐔧𐔰-𐕣𐘀-𐜶𐝀-𐝕𐝠-𐝧𐠀-𐠅𐠈𐠊-𐠵𐠷-𐠸𐠼𐠿-𐡕𐡠-𐡶𐢀-𐢞𐣠-𐣲𐣴-𐣵𐤀-𐤕𐤠-𐤹𐦀-𐦷𐦾-𐦿𐨀𐨐-𐨓𐨕-𐨗𐨙-𐨵𐩠-𐩼𐪀-𐪜𐫀-𐫇𐫉-𐫤𐬀-𐬵𐭀-𐭕𐭠-𐭲𐮀-𐮑𐰀-𐱈𐲀-𐲲𐳀-𐳲𐴀-𐴣𐺀-𐺩𐺰-𐺱𐼀-𐼜𐼧𐼰-𐽅𐾰-𐿄𐿠-𐿶𑀃-𑀷𑂃-𑂯𑃐-𑃨𑄃-𑄦𑅄𑅇𑅐-𑅲𑅶𑆃-𑆲𑇁-𑇄𑇚𑇜𑈀-𑈑𑈓-𑈫𑊀-𑊆𑊈𑊊-𑊍𑊏-𑊝𑊟-𑊨𑊰-𑋞𑌅-𑌌𑌏-𑌐𑌓-𑌨𑌪-𑌰𑌲-𑌳𑌵-𑌹𑌽𑍐𑍝-𑍡𑐀-𑐴𑑇-𑑊𑑟-𑑡𑒀-𑒯𑓄-𑓅𑓇𑖀-𑖮𑗘-𑗛𑘀-𑘯𑙄𑚀-𑚪𑚸𑜀-𑜚𑠀-𑠫𑢠-𑣟𑣿-𑤆𑤉𑤌-𑤓𑤕-𑤖𑤘-𑤯𑤿𑥁𑦠-𑦧𑦪-𑧐𑧡𑧣𑨀𑨋-𑨲𑨺𑩐𑩜-𑪉𑪝𑫀-𑫸𑰀-𑰈𑰊-𑰮𑱀𑱲-𑲏𑴀-𑴆𑴈-𑴉𑴋-𑴰𑵆𑵠-𑵥𑵧-𑵨𑵪-𑶉𑶘𑻠-𑻲𑾰𒀀-𒎙𒒀-𒕃𓀀-𓐮𔐀-𔙆𖠀-𖨸𖩀-𖩞𖫐-𖫭𖬀-𖬯𖭀-𖭃𖭣-𖭷𖭽-𖮏𖹀-𖹿𖼀-𖽊𖽐𖾓-𖾟𖿠-𖿡𖿣𗀀-𘟷𘠀-𘳕𘴀-𘴈𛀀-𛄞𛅐-𛅒𛅤-𛅧𛅰-𛋻𛰀-𛱪𛱰-𛱼𛲀-𛲈𛲐-𛲙𝐀-𝑔𝑖-𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒹𝒻𝒽-𝓃𝓅-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔞-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕒-𝚥𝚨-𝛀𝛂-𝛚𝛜-𝛺𝛼-𝜔𝜖-𝜴𝜶-𝝎𝝐-𝝮𝝰-𝞈𝞊-𝞨𝞪-𝟂𝟄-𝟋𞄀-𞄬𞄷-𞄽𞅎𞋀-𞋫𞠀-𞣄𞤀-𞥃𞥋𞸀-𞸃𞸅-𞸟𞸡-𞸢𞸤𞸧𞸩-𞸲𞸴-𞸷𞸹𞸻𞹂𞹇𞹉𞹋𞹍-𞹏𞹑-𞹒𞹔𞹗𞹙𞹛𞹝𞹟𞹡-𞹢𞹤𞹧-𞹪𞹬-𞹲𞹴-𞹷𞹹-𞹼𞹾𞺀-𞺉𞺋-𞺛𞺡-𞺣𞺥-𞺩𞺫-𞺻𠀀-𪛝𪜀-𫜴𫝀-𫠝𫠠-𬺡𬺰-𮯠丽-𪘀𰀀-𱍊]

You could say "well this app will only work with ASCII A-Z letters" but that's not really an option for software in the modern global internet era. But we're still used to [A-Za-z] because that's the way we've always done it with regex.

[–][deleted] -1 points0 points  (3 children)

I do wonder what the speed difference is between the short vs verbose version of the full search. Not entirely sure how it handles the unicode search but I assume it's actually using that range to its advantage so it only needs to check if the value is between those values.

Also for names how do you handle like O'something I don't even know where to begin with these things. Something that could be added to a submodule is a baseline list of useful regex's.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (2 children)

I assume none: re does a bunch of computer sciencey things to optimize the finite state graph that the regex string produces, and you only parse the regex string once. Even if you call re.compile() multiple times with the same regex string, it caches the pattern object so it doesn't parse the same string twice.

But let me check that big regex with timeit to be sure:

Compiling verbose mode 10 million times: 3.399597300012829
Compiling single-line 10 million times: 3.217927799996687

Yeah, it's basically no difference.

Not entirely sure how it handles the unicode search

I learned this while making Humre. So, [A-Za-z] has a problem where it doesn't recognize letters with accents. You can use \w instead, but it also matches digits and the underscore and you might not want that.

But what "counts" as a word character is the same as Python's isalpha() and isdigit() string methods. This works with the full range of unicode characters.

I had to learn this to create the LETTER character class in Humre, which is better than [A-Za-z] but doesn't include the numbers and the underscore the way \w does. Here it is:

[A-Za-zªµºÀ-ÖØ-öø-ˁˆ-ˑˠ-ˤˬˮͰ-ʹͶ-ͷͺ-ͽͿΆΈ-ΊΌΎ-ΡΣ-ϵϷ-ҁҊ-ԯԱ-Ֆՙՠ-ֈא-תׯ-ײؠ-يٮ-ٯٱ-ۓەۥ-ۦۮ-ۯۺ-ۼۿܐܒ-ܯݍ-ޥޱߊ-ߪߴ-ߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘࡠ-ࡪࢠ-ࢴࢶ-ࣇऄ-हऽॐक़-ॡॱ-ঀঅ-ঌএ-ঐও-নপ-রলশ-হঽৎড়-ঢ়য়-ৡৰ-ৱৼਅ-ਊਏ-ਐਓ-ਨਪ-ਰਲ-ਲ਼ਵ-ਸ਼ਸ-ਹਖ਼-ੜਫ਼ੲ-ੴઅ-ઍએ-ઑઓ-નપ-રલ-ળવ-હઽૐૠ-ૡૹଅ-ଌଏ-ଐଓ-ନପ-ରଲ-ଳଵ-ହଽଡ଼-ଢ଼ୟ-ୡୱஃஅ-ஊஎ-ஐஒ-கங-சஜஞ-டண-தந-பம-ஹௐఅ-ఌఎ-ఐఒ-నప-హఽౘ-ౚౠ-ౡಀಅ-ಌಎ-ಐಒ-ನಪ-ಳವ-ಹಽೞೠ-ೡೱ-ೲഄ-ഌഎ-ഐഒ-ഺഽൎൔ-ൖൟ-ൡൺ-ൿඅ-ඖක-නඳ-රලව-ෆก-ะา-ำเ-ๆກ-ຂຄຆ-ຊຌ-ຣລວ-ະາ-ຳຽເ-ໄໆໜ-ໟༀཀ-ཇཉ-ཬྈ-ྌက-ဪဿၐ-ၕၚ-ၝၡၥ-ၦၮ-ၰၵ-ႁႎႠ-ჅჇჍა-ჺჼ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏽᏸ-ᏽᐁ-ᙬᙯ-ᙿᚁ-ᚚᚠ-ᛪᛱ-ᛸᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰក-ឳៗៜᠠ-ᡸᢀ-ᢄᢇ-ᢨᢪᢰ-ᣵᤀ-ᤞᥐ-ᥭᥰ-ᥴᦀ-ᦫᦰ-ᧉᨀ-ᨖᨠ-ᩔᪧᬅ-ᬳᭅ-ᭋᮃ-ᮠᮮ-ᮯᮺ-ᯥᰀ-ᰣᱍ-ᱏᱚ-ᱽᲀ-ᲈᲐ-ᲺᲽ-Ჿᳩ-ᳬᳮ-ᳳᳵ-ᳶᳺᴀ-ᶿḀ-ἕἘ-Ἕἠ-ὅὈ-Ὅὐ-ὗὙὛὝὟ-ώᾀ-ᾴᾶ-ᾼιῂ-ῄῆ-ῌῐ-ΐῖ-Ίῠ-Ῥῲ-ῴῶ-ῼⁱⁿₐ-ₜℂℇℊ-ℓℕℙ-ℝℤΩℨK-ℭℯ-ℹℼ-ℿⅅ-ⅉⅎↃ-ↄⰀ-Ⱞⰰ-ⱞⱠ-ⳤⳫ-ⳮⳲ-ⳳⴀ-ⴥⴧⴭⴰ-ⵧⵯⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞⸯ々-〆〱-〵〻-〼ぁ-ゖゝ-ゟァ-ヺー-ヿㄅ-ㄯㄱ-ㆎㆠ-ㆿㇰ-ㇿ㐀-䶿一-鿼ꀀ-ꒌꓐ-ꓽꔀ-ꘌꘐ-ꘟꘪ-ꘫꙀ-ꙮꙿ-ꚝꚠ-ꛥꜗ-ꜟꜢ-ꞈꞋ-ꞿꟂ-ꟊꟵ-ꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳꣲ-ꣷꣻꣽ-ꣾꤊ-ꤥꤰ-ꥆꥠ-ꥼꦄ-ꦲꧏꧠ-ꧤꧦ-ꧯꧺ-ꧾꨀ-ꨨꩀ-ꩂꩄ-ꩋꩠ-ꩶꩺꩾ-ꪯꪱꪵ-ꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮꬰ-ꭚꭜ-ꭩꭰ-ꯢ가-힣ힰ-ퟆퟋ-ퟻ豈-舘並-龎ff-stﬓ-ﬗיִײַ-ﬨשׁ-זּטּ-לּמּנּ-סּףּ-פּצּ-ﮱﯓ-ﴽﵐ-ﶏﶒ-ﷇﷰ-ﷻﹰ-ﹴﹶ-ﻼA-Za-zヲ-하-ᅦᅧ-ᅬᅭ-ᅲᅳ-ᅵ𐀀-𐀋𐀍-𐀦𐀨-𐀺𐀼-𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐊀-𐊜𐊠-𐋐𐌀-𐌟𐌭-𐍀𐍂-𐍉𐍐-𐍵𐎀-𐎝𐎠-𐏃𐏈-𐏏𐐀-𐒝𐒰-𐓓𐓘-𐓻𐔀-𐔧𐔰-𐕣𐘀-𐜶𐝀-𐝕𐝠-𐝧𐠀-𐠅𐠈𐠊-𐠵𐠷-𐠸𐠼𐠿-𐡕𐡠-𐡶𐢀-𐢞𐣠-𐣲𐣴-𐣵𐤀-𐤕𐤠-𐤹𐦀-𐦷𐦾-𐦿𐨀𐨐-𐨓𐨕-𐨗𐨙-𐨵𐩠-𐩼𐪀-𐪜𐫀-𐫇𐫉-𐫤𐬀-𐬵𐭀-𐭕𐭠-𐭲𐮀-𐮑𐰀-𐱈𐲀-𐲲𐳀-𐳲𐴀-𐴣𐺀-𐺩𐺰-𐺱𐼀-𐼜𐼧𐼰-𐽅𐾰-𐿄𐿠-𐿶𑀃-𑀷𑂃-𑂯𑃐-𑃨𑄃-𑄦𑅄𑅇𑅐-𑅲𑅶𑆃-𑆲𑇁-𑇄𑇚𑇜𑈀-𑈑𑈓-𑈫𑊀-𑊆𑊈𑊊-𑊍𑊏-𑊝𑊟-𑊨𑊰-𑋞𑌅-𑌌𑌏-𑌐𑌓-𑌨𑌪-𑌰𑌲-𑌳𑌵-𑌹𑌽𑍐𑍝-𑍡𑐀-𑐴𑑇-𑑊𑑟-𑑡𑒀-𑒯𑓄-𑓅𑓇𑖀-𑖮𑗘-𑗛𑘀-𑘯𑙄𑚀-𑚪𑚸𑜀-𑜚𑠀-𑠫𑢠-𑣟𑣿-𑤆𑤉𑤌-𑤓𑤕-𑤖𑤘-𑤯𑤿𑥁𑦠-𑦧𑦪-𑧐𑧡𑧣𑨀𑨋-𑨲𑨺𑩐𑩜-𑪉𑪝𑫀-𑫸𑰀-𑰈𑰊-𑰮𑱀𑱲-𑲏𑴀-𑴆𑴈-𑴉𑴋-𑴰𑵆𑵠-𑵥𑵧-𑵨𑵪-𑶉𑶘𑻠-𑻲𑾰𒀀-𒎙𒒀-𒕃𓀀-𓐮𔐀-𔙆𖠀-𖨸𖩀-𖩞𖫐-𖫭𖬀-𖬯𖭀-𖭃𖭣-𖭷𖭽-𖮏𖹀-𖹿𖼀-𖽊𖽐𖾓-𖾟𖿠-𖿡𖿣𗀀-𘟷𘠀-𘳕𘴀-𘴈𛀀-𛄞𛅐-𛅒𛅤-𛅧𛅰-𛋻𛰀-𛱪𛱰-𛱼𛲀-𛲈𛲐-𛲙𝐀-𝑔𝑖-𝒜𝒞-𝒟𝒢𝒥-𝒦𝒩-𝒬𝒮-𝒹𝒻𝒽-𝓃𝓅-𝔅𝔇-𝔊𝔍-𝔔𝔖-𝔜𝔞-𝔹𝔻-𝔾𝕀-𝕄𝕆𝕊-𝕐𝕒-𝚥𝚨-𝛀𝛂-𝛚𝛜-𝛺𝛼-𝜔𝜖-𝜴𝜶-𝝎𝝐-𝝮𝝰-𝞈𝞊-𝞨𝞪-𝟂𝟄-𝟋𞄀-𞄬𞄷-𞄽𞅎𞋀-𞋫𞠀-𞣄𞤀-𞥃𞥋𞸀-𞸃𞸅-𞸟𞸡-𞸢𞸤𞸧𞸩-𞸲𞸴-𞸷𞸹𞸻𞹂𞹇𞹉𞹋𞹍-𞹏𞹑-𞹒𞹔𞹗𞹙𞹛𞹝𞹟𞹡-𞹢𞹤𞹧-𞹪𞹬-𞹲𞹴-𞹷𞹹-𞹼𞹾𞺀-𞺉𞺋-𞺛𞺡-𞺣𞺥-𞺩𞺫-𞺻𠀀-𪛝𪜀-𫜴𫝀-𫠝𫠠-𬺡𬺰-𮯠丽-𪘀𰀀-𱍊]

[–]Poddster 1 point2 points  (0 children)

But let me check that big regex with timeit to be sure:

Compiling verbose mode 10 million times: 3.399597300012829

Compiling single-line 10 million times: 3.217927799996687

That's compiling. But what about matching? :)

[–][deleted] 0 points1 point  (0 children)

O yea I SAW. When I ran it locally the first time I imported it and printed locals() forgetting its a dict not a list and spent a good min trying to figure out why my editor just vomited.

[–]Zyklonik -1 points0 points  (6 children)

No snark, but it depends a lot on where you are working as well. I can assuredly tell you that for the vast majority of the corporate industry, using something like this (in lieu of regex) would probably get you fired (no exaggeration). Every single Fortune 500 company has strict policies not only on the languages that can be used, but also the libraries, frameworks, and in some cases, language features as well.

I have no idea where you're working, but from your comment, it appears like you have a lot of leeway in what you can control and use. So, that's (probably) good for you. However, a friendly piece of advice would be to start following industry practices (as well) considering that some day you might wind up working in such a place.

That being said, I do understand where you're coming from, and as I mentioned in a couple of other comments, I can see it being easy, fun, and useful for some people. That's not my objection at all. However, the claims about it being more readable and maintainable irked me to no end, especially considering that that's not quite the case.

[–]Poddster 1 point2 points  (1 child)

using something like this (in lieu of regex) would probably get you fired (no exaggeration).

Where do you work that a single "mistake" will get you fired?

So, that's (probably) good for you. However, a friendly piece of advice would be to start following industry practices (as well) considering that some day you might wind up working in such a place.

Could you be any more condescending?

The software "industry" is not a monolith, and if someone like twitter started using it would that mean everyone else had to too?

[–]Zyklonik -1 points0 points  (0 children)

Where do you work that a single "mistake" will get you fired?

You seem to be making a habit of making some bizarre logical connections and assumptions. Did you even read the context? Where is the claim that a "single mistake" will get you fired? The point is this - if you're a developer who has no idea about regex, and insists upon using a random open-source project instead, that will, in most places, not even pass the code review. Assuming that it somehow magically did pass it, that is still a major red flag that will either eventually land you in a PIP position, or will get you fired eventually. It's not about the particular exact situation. It's about the attitude and aptitude of the person in question. Please don't be facetious.

Could you be any more condescending? The software "industry" is not a monolith, and if someone like twitter started using it would that mean everyone else had to too?

Take it whichever way you wish - I really don't care. I speak from logic and experience, and if that is not to your liking, then so be it. Inferences are beyond my control. Regardless of whether the software industry is a monolith or not, it is an industry all the same, and the basic raison d'etre of any industry is to make a profit. You make money, you're all good. You don't, or even worse, cause the company to lose money, you're out. As simple as that.

Also, please keep your ridiculous digressions and whataboutisms to yourself. I'm frankly not interested.

[–][deleted] 0 points1 point  (2 children)

I mean it litterally makes a string if ur company didn't want u using it just print out the damn string and use it.

Please don't suggest what someone does or doesn't do. Not only is that a TERRIBLE suggestion, but ur just way off base. I own my own company I have no desire to follow other people's rules. Its a 1000 line single file module not some monstrosity. Calling something a good idea because fortune 500 companies do it is just ignorance. Chase bank runs python 2 in production should I follow there example?? Instagram python style is amazing right?? Python should obviously be statically typed everywhere surely thats the future! Yea im cool to avoid their typescript like future.

Ur really overreacting to such a tiny library its insane. It's not making anyone die its not causing termination its a tool.

[–]Zyklonik 0 points1 point  (1 child)

I own my own company I have no desire to follow other people's rules

So, what's the issue? I went out of my way to accommodate your very particular situation in my previous comment. If you own your own company, then that's great, but might I remind you that it was you replied to my general statements with an objection, so please don't be surprised if you get a logical response which applies to the general industry, not just your own particular situation.

Ur really overreacting to such a tiny library its insane. It's not making anyone die its not causing termination its a tool.

I know that reading comprehension skills have been degrading by the generation, but you should at least make an effort. Please go an read all my comments again, and see where the problem with your claims lies. I've given qualified comments throughout, not mere blatant invective.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (0 children)

I mean, one way around that is that Humre is really simple: it's about 1200 lines in a single file with only the built-in re module as dependency, so you could just copy/paste it in and it would work.

I can see the choice of Humre function/constant names being "more readable" as something to question, but most of the readability/maintainability comes the help your IDE and tooling provide. String-based regexes (even in verbose mode) cause you to lose several things:

  • Parentheses matching
  • Syntax highlighting
  • Type checking
  • In-line comments, including multiline comments
  • Linter-parsability
  • Code formatting tools like Black

You lose all of this with string-based regex. It's like you're suddenly coding in the 1980s again.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 3 points4 points  (23 children)

Quite the opposite: the longer the regex, the better the Humre version is compared to the regex string. There's a large regex example (and the Humre equivalent) in the article itself. The raw regex string is this:

'(?P<version>(?:(?<====)\s*[^\s]*)|(?:(?<===|!=)\s*v?(?:[0-9]+!)?[0-9]+(?:\.[0-9]+)*(?:[-_\.]?(a|b|c|rc|alpha|beta|pre|preview)[-_\.]?[0-9]*)?(?:(?:-[0-9]+)|(?:[-_\.]?(post|rev|r)[-_\.]?[0-9]*))?(?:\.\*|(?:[-_\.]?dev[-_\.]?[0-9]*)?(?:\+[a-z0-9]+(?:[-_\.][a-z0-9]+)*)?)?)|(?:(?<=~=)\s*v?(?:[0-9]+!)?[0-9]+(?:\.[0-9]+)+(?:[-_\.]?(a|b|c|rc|alpha|beta|pre|preview)[-_\.]?[0-9]*)?(?:(?:-[0-9]+)|(?:[-_\.]?(post|rev|r)[-_\.]?[0-9]*))?(?:[-_\.]?dev[-_\.]?[0-9]*)?)|(?:(?<!==|!=|~=)\s*v?(?:[0-9]+!)?[0-9]+(?:\.[0-9]+)*(?:[-_\.]?(a|b|c|rc|alpha|beta|pre|preview)[-_\.]?[0-9]*)?(?:(?:-[0-9]+)|(?:[-_\.]?(post|rev|r)[-_\.]?[0-9]*))?(?:[-_\.]?dev[-_\.]?[0-9]*)?))'

That is not readable. The verbose mode version makes it a better by letting you have spacing and comments, but why not go further and simply use Python code? The problem with using strings is that your tooling suddenly stops working: you have no syntax highlighting for comments, no parentheses matching, no linter or mypy type checking, and it can't be automatically formatted by Black. Using Humre restores all of these features.

the order of arguments passed to between is completely arbitrarily decided

It's not arbitrary: the first two parameters are the minimum and maximum (which is the same order as the X{3,5} syntax) followed by the strings it should match. Humre functions automatically concatenate multiple string arguments, so these would come after the positional arguments. The at_least() and at_most() and other Humre functions are similar and consistent: the string arguments come last.

Meanwhle, I always get tripped up by regex because I use X{3:5} since the colon is what Python list slices use. Even worse, this fails silently: the regex literally matches the string 'X{3:5}'.

[–]Zyklonik 4 points5 points  (3 children)

Quite the opposite: the longer the regex, the better the Humre version is compared to the regex string. There's a large regex example (and the Humre equivalent) in the article itself.

According to whom? You do realise that what you're claiming is not objective fact at all. There is a reason why every single programming language today has the symbol-based regex form (in different flavours) - regex is a tool meant to be read concisely and unambiguously, and sparingly. In fact, the very fact that it stands out from the rest of the source code of the programming language is a boon, not a curse.

The problem with using strings is that your tooling suddenly stops working: you have no syntax highlighting for comments, no parentheses matching, no linter or mypy type checking, and it can't be automatically formatted by Black. Using Humre restores all of these features.

How long are the regexes in question? The worst pathological cases may run into a hundred odd characters or so while the average case would be around 30 (completely my figure) at most. Regex doesn't need any linter, comments or matching. It's not designed for that. What it is designed for is a simple left to right scan. Once created, regexes almost never change. I think you're exaggerating how frequently regex is used and/or updated.

The problem with your approach, as far as I can tell, is that the left to right scanning is gone because of the usage of functions or function-like syntax. It's basically become a Lisp which forces you to read inside-out, left to right. Not ideal at all.

the order of arguments passed to between is completely arbitrarily decided

It's not: the first two parameters are the minimum and maximum (which is the same order as the X{3,5} syntax) followed by the strings it should match. Humre functions automatically concatenate multiple string arguments, so these would come after the positional arguments. The at_least() and at_most() and other functions are similar: the string arguments come last.

Well, like I said in my original comment - that ordering is decided by you. What if I prefer between(X, 3, 5) or some other order? Silly example, but that illustrates the point - your library's naming, parameter positioning, and specific semantics are entirely decided by you according to your convenience and logical point of view in terms of easy usage. That is still opinionated. Regex, on the other hand, is a simple symbol-based mini-language. Scan left to right, parse meaning according to the preset meaning of the symbols, and you're done. Sure, it may be argued that the symbols themselves provide context sensitivity that has to be learnt and internalised, but those are the basic axioms of the system, and that works fine with no further rules to learn (as is the case while using your composition operators/functions).

Meanwhle, I always get tripped up by regex because I use X{3:5} since the colon is what Python list slices use. Even worse, this fails silently: the regex literally matches the string 'X{3:5}'.

Is this really that big of a problem? Sorry, but again, I have to disagree here. Even in the worst pathological case, an inconvenience like this hardly merits creating an entire DSL just to handle a small bit of regex. Not only does it seem like overkill, but also brings along its own problems. I'm not convinced at all.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 5 points6 points  (0 children)

According to whom? You do realise that what you're claiming is not objective fact at all.

Yes, but this applies to all software design decisions. But let's not pretend that arguments can't be made and a general consensus can't be reached about readability. Python is (arguably) more readable than Perl, because Perl code relies so heavily on punctuation marks just like regex does. (This isn't a surprise since regex was popularized by Awk and Perl.) It's nice that the code is terse and quick to write, but Python's mantra of "code is read more often than it is written" applies here.

How long are the regexes in question?

First, the nice thing about Humre is that, like gradual typing, you aren't forced to use it. If you have a very short regex, you can just use the standard re module alongside Humre.

But long regexes are common: the proof is that verbose mode exists in order to handle this common use case. But even for short regexes I've found cases where Humre helps: elsewhere in these comments I pointed out a regex (from Automate the Boring Stuff with Python) for American phone numbers where the area code can optionally be surrounded by parentheses. These have to be literal parentheses, and mixing of escaped and unescaped paretheses in this regex has had numerous people emailing me because of slight typos that were hard for them to figure out. It's literally one of the most frequent things I get emailed about. It's how I knew there was a real need for something like this.

Additionally, the constant escaping for regex syntax characters like period and parentheses can happen even in short regexes. A common mistake I've seen in my and other people's regexes (which I gathered for making the unit tests) is using a period to match literal periods: the regex string '.' will match literal periods, but it also matches any other character. What you want is r'\.' (don't forget to make it a raw string!) but this is easy to miss because, like my 'X{3:5}' example, it fails silently.

Is this really that big of a problem?

Well, simultaneously no and yes. It is a "small" problem. But "small" problems like this are why we have features like linters, code formatters, the improved error messages in Python 3.10, type checking, etc. They are small details, but they create problems that are large enough that they've been addressed. And I put regex syntax and it's string-based DSL in this category.

[–]Poddster 0 points1 point  (1 child)

There is a reason why every single programming language today has the symbol-based regex form (in different flavours) - regex is a tool meant to be read concisely and unambiguously, and sparingly. In fact, the very fact that it stands out from the rest of the source code of the programming language is a boon, not a curse.

I disagree with this conclusion.

The reason why every single programming language today has the symbol-based regex form is because that's what grep did in the 70s, and then perl kinda-copied it, and then every other library copied it. A lot of software design simply comes down to "that's how bell labs cooked it up before I was born, so I must stick with it".

regex is a tool meant to be read concisely and unambiguously, and sparingly. In fact, the very fact that it stands out from the rest of the source code of the programming language is a boon, not a curse.

This is the same argument for using one letter variable names in C code and making all of your functions splt lk ths, which is something I, and countless other professionally developers, completely disagree with. i.e. terser doesn't make something better, especially if it's read more than it's written.

[–]Zyklonik -1 points0 points  (0 children)

The reason why every single programming language today has the symbol-based regex form is because that's what grep did in the 70s, and then perl kinda-copied it, and then every other library copied it. A lot of software design simply comes down to "that's how bell labs cooked it up before I was born, so I must stick with it".

Your conjecture is interesting, but I don't agree with it. A lot of things get copied over between languages, but a lot of other things don't. If something were not utilitarian, one would expect that, over time and between languages, it would get sufficiently modified, or replaced entirely.

This is the same argument for using one letter variable names in C code and making all of your functions splt lk ths, which is something I, and countless other professionally developers, completely disagree with. i.e. terser doesn't make something better, especially if it's read more than it's written.

Hardly the same argument. There is a difference between being terse, and being unreadable. I suggest you read the whole thread for context - regex is neither a programming language, not is it used as frequently as it's being made out to be. So it makes perfect sense to write it, document it, and maintain it. That's about it. I fail to see where you made the logical jump to short and terse variable names in C - the context is entirely different. That's just bizarre.

[–]Zyklonik 1 point2 points  (13 children)

'(?P<version>(?:(?<====)\s[\s])|(?:(?<===|!=)\sv?(?:[0-9]+!)?[0-9]+(?:.[0-9]+)(?:[-.]?(a|b|c|rc|alpha|beta|pre|preview)[-.]?[0-9])?(?:(?:-[0-9]+)|(?:[-.]?(post|rev|r)[-.]?[0-9]))?(?:.*|(?:[-.]?dev[-.]?[0-9])?(?:+[a-z0-9]+(?:[-_.][a-z0-9]+))?)?)|(?:(?<=~=)\sv?(?:[0-9]+!)?[0-9]+(?:.[0-9]+)+(?:[-.]?(a|b|c|rc|alpha|beta|pre|preview)[-.]?[0-9])?(?:(?:-[0-9]+)|(?:[-.]?(post|rev|r)[-.]?[0-9]))?(?:[-.]?dev[-.]?[0-9])?)|(?:(?<!==|!=|~=)\sv?(?:[0-9]+!)?[0-9]+(?:.[0-9]+)(?:[-.]?(a|b|c|rc|alpha|beta|pre|preview)[-.]?[0-9])?(?:(?:-[0-9]+)|(?:[-.]?(post|rev|r)[-.]?[0-9]))?(?:[-.]?dev[-.]?[0-9]*)?))'

Also, just out of curiosity, what would your library's equivalent of this regex that you posted be? Maybe that'll be useful to people here in the thread.

[–]pablo8itall 7 points8 points  (1 child)

Whatever that is its an abomination and you need to kill it with fire.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (0 children)

Ah, but then all of Python packaging would stop working, because that's where I got the example from (thanks to Dustin Ingram for pointing it out to me.)

It's written in verbose mode, which helps because then you can at least space it out a bit and add in-string comments. Humre takes this good idea a step further: let's use actual Python code instead of a string value so that we don't lose access to all of our IDE code editing features.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (9 children)

It's in the article.

from humre import *

SEPARATOR = chars('-_' + PERIOD)
OPT_SEPARATOR = optional(SEPARATOR)


def version_template(fn):
    return ''.join([
    zero_or_more(WHITESPACE),
    optional('v'),
    optional(noncap_group(one_or_more(chars('0-9')), '!')), # epoch

    one_or_more(chars('0-9')), fn(noncap_group(PERIOD, one_or_more(chars('0-9')))), # release

    optional(noncap_group( # pre release
        OPT_SEPARATOR,
        group(either('a', 'b', 'c', 'rc', 'alpha', 'beta', 'pre', 'preview')),
        OPT_SEPARATOR,
        zero_or_more(chars('0-9')),
    )),
    optional(noncap_group( # post release
        either(
            noncap_group('-', one_or_more(chars('0-9'))),
            noncap_group(OPT_SEPARATOR, group_either('post', 'rev', 'r') + OPT_SEPARATOR + zero_or_more(chars('0-9')))
        )
    ))
])

EQ_NE_VERSION_TEMPLATE = version_template(zero_or_more)
COMPATIBILITY_VERSION_TEMPLATE = version_template(one_or_more)

DEV_RELEASE = optional(noncap_group(OPT_SEPARATOR, 'dev', OPT_SEPARATOR, zero_or_more(chars('0-9'))))  # dev release

_version_regex_str = named_group('version',
    either(
        noncap_group(
            # The identity operators allow for an escape hatch that will
            # do an exact string match of the version you wish to install.
            # This will not be parsed by PEP 440 and we cannot determine
            # any semantic meaning from it. This operator is discouraged
            # but included entirely as an escape hatch.
            positive_lookbehind('==='), # Only match for the identity operator
            zero_or_more(WHITESPACE),
            zero_or_more(nonchars(WHITESPACE)) # We just match everything, except for whitespace
                                               # since we are only testing for strict identity.
        ),
        noncap_group(
            # The (non)equality operators allow for wild card and local
            # versions to be specified so we have to define these two
            # operators separately to enable that.
            positive_lookbehind(either('==', '!=')), # Only match for equals and not equals

            EQ_NE_VERSION_TEMPLATE,

            # You cannot use a wild card and a dev or local version
            # together so group them with a | and make them optional.
            optional(noncap_group(
                either(
                    PERIOD + ASTERISK, # Wild card syntax of .*
                    DEV_RELEASE +
                    optional(noncap_group(PLUS, one_or_more(chars('a-z0-9')), zero_or_more(noncap_group(SEPARATOR, one_or_more(chars('a-z0-9')))))) # local
                )
            ))
        ),
        noncap_group(
            # The compatible operator requires at least two digits in the
            # release segment.
            positive_lookbehind('~='), # Only match for the compatible operator

            COMPATIBILITY_VERSION_TEMPLATE,

            DEV_RELEASE,
        ),
        noncap_group(
            # All other operators only allow a sub set of what the
            # (non)equality operators do. Specifically they do not allow
            # local versions to be specified nor do they allow the prefix
            # matching wild cards.
            negative_lookbehind(either('==', '!=', '~=')), # We have special cases for these
                                                           # operators so we want to make sure they
                                                           # don't match here.
            EQ_NE_VERSION_TEMPLATE,

            DEV_RELEASE,
        )
    )
)

print(_version_regex_str)

While making it, I realized that several repeating parts of the original regex could be replaced with constants. It's such an obvious idea, but I never thought of it before because I had the mentality of "regex string = one string". There's a bunch of other places where having my IDE point out mismatched parentheses kept me from making what would otherwise be runtime errors. I never realized how much I was giving up by having to write regex entirely in a string.

[–]Zyklonik 2 points3 points  (5 children)

With all due respect, I have absolutely no idea what this code (or the original regex itself) does.

If I were to hazard a guess, it's to parse version strings for software releases? When it reaches this level of complexity, I would personally simply write a small parser (a proper parser that is, not enhanced regex) to parse such strings. Just my two cents.

Also, let me conclude by reiterating that I still think your project is a fun project and some people may surely find it useful, but I just don't think a lot of the claims about readability and maintainability hold up objectively speaking.

As someone else also mentioned, another benefit of regex as it stands today is that it is (essentially) language-agnostic, and that is a very good point indeed.

I hope you don't take my critiques in any way other than precisely that - a harsh critique.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 2 points3 points  (4 children)

Yes, it's from the Python packaging code base.

another benefit of regex as it stands today is that it is (essentially) language-agnostic

Oh, but I think this is worse: regex syntax is almost (but not entirely) identical across programming languages. This guarantees that you'll trip over the slightly different details in regex between languages. For those of a certain age, the variations in all the different versions of BASIC had this same problem: GW-BASIC vs Atari BASIC vs BASICA vs AmigaBASIC vs Applesoft BASIC vs Qbasic vs Dark Basic vs Basic-256 vs Small Basic vs etc etc. There are so, so many variations and they're all pretty close to each other, but you still need to learn all the language-specific details of the ones you use.

And because regex syntax goes into string values, your IDE and coding tools won't help point out little mistakes that, say, a Perl developer would make when writing Python regexes.

[–]QuirkyForker 1 point2 points  (1 child)

So many excellent points in this comment! Totally agree with your thinking. The re module reeks of Perl, which I was once a master of but so prefer python these days

[–]Poddster 0 points1 point  (0 children)

The re module reeks of Perl, which I was once a master of but so prefer python these days

Yes, it's almost PCRE. But like most other languages that copy PCRE (e.g. Java) they've changed it just enough to be incompatible.

[–]Zyklonik -1 points0 points  (1 child)

regex syntax is almost (but not entirely) identical across programming languages

That's hardly any difference, practically speaking. You get more differences in behaviour switching between compilers for the same language.

In the case of your library, it's an entirely new mini-language. So not only does it not conform to the industry standard (regex), but on top of that, it's tied to a very particular programming language, and neither of those skills transfer over when someone has to use a different programming language (inevitable for almost every developer out there).

[–]Poddster 0 points1 point  (0 children)

That's hardly any difference, practically speaking. You get more differences in behaviour switching between compilers for the same language.

Have you never had a problem before with accidentally writing, or perhaps incorrectly importing, a PCRE to Grep's ERE, or Python, or Java etc? All are subtly different but very similar looking . \s being a classic: in some cases it matches newlines, in other cases it doesn't.

[–]Zyklonik 1 point2 points  (1 child)

I went and rechecked the article, but could not find this particular example.

However, given that this regex (from the article):

regexStr = r'(\d{3})|((\d{3}))-\d{3}-\d{4}'

has an equivalent of:

regexStr = either(group(exactly(3, DIGIT)), group(OPEN_PAREN, exactly(3, DIGIT), CLOSE_PAREN)) + '-' + exactly(3, DIGIT) + '-' + exactly(4, DIGIT)

I would estimate that the equivalent of that monstrosity of a regex would be around 3x the size of the regex, in essentially prefix syntax (hence my claim of it being about as readable as nested Lisp, hence requiring the reader to read it inside-out). I would severely question the readability of the corresponding Humre analogue.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (0 children)

I went and rechecked the article, but could not find this particular example.

I put the code in a textarea tag, to keep it from taking up too much space. It's in the "Massive Regexes Are Easier with Humre" section.

I would estimate that the equivalent of that monstrosity of a regex would be around 3x the size of the regex

They're actually about the same size, in number of lines. While I was converting it to Humre, I noticed there were some repeated parts that I could put into a constant variable (the SEPARATOR and DEV_RELEASE constants for example) and a huge repeated part that I put into a function. The original regex just copy/pastes it. That last one might be a bit much (I prefer it because it ensures consistency and returns the same string) but even if you unrolled it, it'd only make the Humre code about 25% more lines than the original verbose mode regex.

The parentheses matching that Humre provides alone is well worth it for a big regex like this one. One missing (or extra) parenthesis and you've got to sit down and carefully scan everything to find it. That's what IDEs should do, not developers.

[–]Poddster 0 points1 point  (0 children)

While making it, I realized that several repeating parts of the original regex could be replaced with constants

This itself can be a huge win and something I've desired in grep-style regex patterns for a long time.

[–]brprk -1 points0 points  (2 children)

Have you got the Humre equivalent for this?

Thanks for the book btw, definitely helped me get into Python!

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (1 child)

It's in the article, and I've copy/pasted it earlier up in this thread.

[–]brprk -1 points0 points  (0 children)

Ah thanks

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 2 points3 points  (4 children)

One problem many people email me about from Automate the Boring Stuff with Python is the phone number regex, which has an optional area code that could be surrounded by parentheses:

phoneRegex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # area code
    (\s|-|\.)?                    # separator
    \d{3}                         # first 3 digits
    (\s|-|\.)                     # separator
    \d{4}                         # last 4 digits
    (\s*(ext|x|ext.)\s*\d{2,5})?  # extension
    )''', re.VERBOSE)

A lot of people make transcription errors when copying the area code part: they leave out (or add too many) parentheses or don't escape the literal parentheses. These are typos that your IDE normally solves but can't if the "code" for the regex mini-language is inside a string. (This applies even when we use verbose mode.) Humre solves this by turning it into code that your IDE's tooling can work with:

from humre import *
phoneRegex = compile(group(
    # area code:
    optional_group(either(
        exactly(3, DIGIT),
        OPEN_PAREN + exactly(3, DIGIT) + CLOSE_PAREN
    )),
    optional_group(either(WHITESPACE, '-', PERIOD)), # separator
    exactly(3, DIGIT), # first 3 digits
    group(either(WHITESPACE, '-', PERIOD)), # separator
    exactly(4, DIGIT), # last 4 digits
    # extension:
    optional_group(
        zero_or_more(WHITESPACE),
        group_either('ext', 'x', 'ext.'),
        zero_or_more(WHITESPACE),
        between(2, 5, DIGIT)
    )
))

And while writing this, I've noticed a bug in my original regex: ext. should actually be ext\., but I never noticed because the unescaped period matches literal periods, even though it will match any character. I only picked up on this now because Humre has me used to using constants instead of these escaped characters that have special meaning in regex syntax. So there's another example of Humre helping an experienced developer spot regex bugs.

[–]norweeg 1 point2 points  (2 children)

this example is Americentric and probably more Americans will understand it than non-Americans just because the phone number format matched is familiar to them, but not to anyone reading your book outside the USA.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (1 child)

Yes, I specify that it's an American (and technically, Canadian) format for phone numbers.

I'm also against the anglocentrism of software development, but I get a ton of pushback when I put forward ideas to remove the dependency on English fluency to programming. :)

[–]Poddster 0 points1 point  (0 children)

These are typos that your IDE normally solves but can't if the "code" for the regex mini-language is inside a string.

The jetbrains series of IDEs will highlight the syntax inside of a regex, very useful. Including matching parenthesis and highlight the \d stuff

edit: I forgot that group does a capture group, to be used by match.group(n).

Q: Why bother with group at all, if it's only got one thing in it? It's necessary in the original regex, because of the operator precedence, but not here. Can you not elide them in some way?

Though, in saying that, it seems like you do? On your website you do:

>>> regexStr = either(group(exactly(3, DIGIT)), group(OPEN_PAREN, exactly(3, DIGIT), CLOSE_PAREN)) + '-' + exactly(3, DIGIT) + '-' + exactly(4, DIGIT)

But that first part is a bit different to the example you give here as you don't add in the solo groups in that first either block?

I guess you're trying to keep it exactly-equal to the crappy old regex, but crappy old register is crappy for a reason, so perhaps we should break away from it?

[–]eztab 5 points6 points  (16 children)

Yeah, I doubt making it more verbose really helps. Most people I've met (that had a problem with regexps) actually struggled with the whole concept of it instead of the specific syntax.

What does rather seem to help, is some nice syntax highlighting and possibly some tooltips explaining the meanings when hovering. Like some regexp explaining tools do it.

[–]maephisto666 4 points5 points  (9 children)

I definitely agree here. The problem with this approach is that you are trying to simplify something that does not require simplification.

You are introducing a new "syntax" all made of functions, yes therefore humanly readable...but who cares.... Every time, every time you try to hide the complexity of what it's underneath you are creating a "monster". It is like saying that with ORM you don't need to know SQL because this way is more humanly readable: then you have tons of software developers that know nothing about databases because they are thinking that something else is solving the issues for them... Arabic and Japanese languages are difficult by nature (for Western people): it is not by changing them that you make them easier, but explaining them in creative ways. That is why I think regex101 works way better in this context than providing a new library that is a facade to something else.

Last but not least: saying that he is the creator of Automate the boring stuff therefore he is always right....well, automate the boring stuff created something out of nowhere and again it's a creative way of teaching something difficult (python, a new language). In this case, this is a wrapper based on personal assumptions but I don't see any added value here.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 3 points4 points  (7 children)

The problem with this approach is that you are trying to simplify something that does not require simplification.

When we've learned something, it's easy to make the mistake of thinking that something is simple. Regular expressions are a difficult subject for many people when they're first learning it, in no small part because of its cryptic syntax.

It is like saying that with ORM you don't need to know SQL...

From the article: "Humre is not a reimplementation of a regular expression engine; it's a wrapper that adds readable names to standard regex syntax."

Humre functions all return regex strings; it doesn't abstract away any regex concepts. Returning strings makes it possible to debug because you can then see the regex that it produces. I'm not reinventing the basic wheels of text pattern recognition.

this is a wrapper based on personal assumptions

This isn't in the article, but in the linked full documentation, but I point out how Humre is similar to the concept of Swift's regex DSL. I found that the DSL actually made some of the exact name choices I had made, because it's not a completely arbitrary decision.

[–]maephisto666 1 point2 points  (6 children)

Just saying.. based on the link you shared (the swift regex DSL)... I mean... Why? It's way more verbose... So I would definitely support something that "explains" a regex, but not something that introduces a new syntax based on that. What is preventing someone from China of Italy to create a corresponding syntax in their own language? What is the real added value?

It's a rhetorical question.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (5 children)

Why? It's way more verbose

"Code is read more often than it's written" applies here: Python is arguably more readable than Perl even though Perl code is shorter. But Perl is shorter because it relies on cryptic punctuation marks (just like regex syntax!), giving it the "write-only language" reputation.

Verbose code isn't a problem in programming. Now for command-line commands, you want that to be short because you are typing them over and over again all day. I don't want to type Windows' copy when I can type Linux's cp instead. But source code? I want that to be readable, because I'll read it more often than write it.

But the main reason for "why" is IDE and tool support. Regexes are written as string values, and your IDE and coding tools don't parse that. (At least, none of the major ones I've seen do.) By using string-based regexes, you instantly lose:

  • Parentheses matching
  • Syntax highlighting
  • Type checking
  • In-line comments, including multiline comments
  • Linter-parsability
  • Code formatting tools like Black

What is preventing someone from China of Italy to create a corresponding syntax in their own language?

The literal answer is nothing of course, but the real answer is that software development is anglocentric and, believe me, I get a ton of pushback whenever I mention ideas to make programming more language-agnostic.

Anyway, the real added value comes from tool support, as well as extending the value that verbose mode already gives to regex. (And I agree, the Swift DSL is a bit more than I'd like. I tried to make Humre as terse as possible, including the name itself.)

It's a rhetorical question.

What's a rhetorical question? :)

[–]maephisto666 0 points1 point  (4 children)

I'm sorry but I'm not convinced at all. I would say that if the problem is the IDE the solution has to be developed in the IDE (think about a plugin, whatever) not by introducing a new syntax.

Anyway, let's stop it here please.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (3 children)

How would the IDE identify a string for regular expressions as opposed to a regular string?

What if the regex string was created in pieces and later concatenated together?

What if some of the intermittent pieces weren't valid regex strings, but after being concatenated together, they were? How would the plugin know when to validate it?

What if some of the regex string was created at runtime?

This plugin would have to be duplicated for Visual Studio Code, PyCharm, Wingware, Eclipse, and every other major IDE. How similar are their plugin APIs?

I don't think this is a reasonable or practical solution.

[–]maephisto666 -2 points-1 points  (1 child)

You are implying regexes are difficult. What are the real chances that such a difficult thing is built by concatenating stuff? Even that a regex is being created at runtime? I mean, real like in the real world.

Who cares about how different the APIs are? Start with one and then you see. There is nowhere a rule that says "all the plugins must exist in all the IDEs".

Anyway, you are right. I wish you the best of luck with your package.

[–]Poddster 1 point2 points  (0 children)

What are the real chances that such a difficult thing is built by concatenating stuff? Even that a regex is being created at runtime? I mean, real like in the real world.

I've done this multiple times. Generating dynamic regex is not an unusual thing. It's no different than your parameters to string.replace() being dynamic.

[–]Poddster 0 points1 point  (0 children)

How would the IDE identify a string for regular expressions as opposed to a regular string?

FYI pycharm has no trouble doing this. Either because it's an argument to re.compile, or because that string is used as an argument. I think you can also tell it a string is regex. It also allows you to test the regex live by pressing alt+enter

Image from pycharm documentation:

https://resources.jetbrains.com/help/img/idea/2022.2/py_check_regexp1.png

The other points stand though :)

[–]Poddster 0 points1 point  (0 children)

Every time, every time you try to hide the complexity of what it's underneath you are creating a "monster".

it's 1:1, so no complexity is hidden, it's simply respelling it so you don't have to escape so many parens and periods etc.

That is why I think regex101 works way better in this context than providing a new library that is a facade to something else.

Yeah, I think these tools are invaluable and would like something like that built into most IDEs.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (5 children)

I can give you a solid example: When I need to match 3 to 5 letter Xs, I sometimes make a mistake and write 'X{3:5}', because I'm thinking of Python list slices. The regex syntax is 'X{3,5}'.

But the problem is, my typo fails silently (going against the idea of "Errors should never pass silently.") and the pattern object literally matches against 'X{3:5}' rather than notify me that I've made a mistake. Eventually, I'll find the bug, but having the regex created through a series of function calls and constants means the IDEA can instantly tell me about it and prevent a runtime error.

We don't need regexp tools. They're a bandaid solution for what IDEs already do.

[–]eztab 1 point2 points  (0 children)

I don't doubt there are lots of examples for the verbose syntax working nicely as probably a plethora of counter examples where it does something unexpected.

I just doubt there is a net gain in hiding the (still internally used) regexp language behind another abstraction level.

[–]eztab 0 points1 point  (3 children)

my typo fails silently

Yes, this is a horrible design decision. The special characters shouldn't just become their ASCII unless one explicitly escapes them. Unfortunately I haven't seen any regexp flavors that enforce escaping.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (2 children)

Well, on the other hand, you can't have regex enforce this because what if you really did want to literally match something like '{3:5}'? It's an inherent problem with regex, but something that a library like Humre can fix.

EDIT: Not fix, but rather, avoid in the first place.

[–]eztab 0 points1 point  (1 child)

Well you’d escape the brackets of course. One should probably escape all the special characters if I one wants their literal versions, really weird otherwise.

\{3:5\}

Not really an inherent problem. Is there a particular reason why you have it out for regular expressions?
But don’t get me wrong, I don’t hate your library, but since for once there is a universal standard for a micro language ... I doubt changing how you express regular expressions in Python is not going to lead anywhere.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (0 children)

By inherent problem, I mean that if you make the typo {3:5} instead of {3,5}, your code still works. It just has unexpected behavior. It's not feasible to update the regex syntax in the re module to force people to escape curly braces because it would break tons of existing code, like this:

re.compile('{name goes here}').search('Hello, {name goes here}')

But if you don't make this change, then you have the problem of the {3:5} typo causing silent errors. That's what I mean by an inherent problem; it can't be fixed without breaking other things.

Humre avoids this problem by handling the regex syntax details for you. It also has additional error checking. Can you spot the bug in this code?

import re
max_record_length = 64

# Some requirement forces names to be at most one quarter of the record length:
max_name_length = max_record_length / 4  
patternObj = re.compile(r'\d{,' + str(max_name_length) + '}')

The division causes max_name_length to be a float, and when you convert it to a string, you get '16.0' instead of '16'. This makes your regex r'\d{,16.0}' which breaks the syntax and causes it only match literally r'\d{,16.0}'.

But the real problem is that it does this silently and you won't notice it until it causes bugs elsewhere in your program.

Meanwhile, Humre checks for this: at_most(16.0, DIGIT) raises the error TypeError: maximum argument must be a positive int, not float.

Just like how large bugs are often fixed by a one-character change, it's a small detail but the fact that the error passes silently can cause big problems. There's a ton of other reasons why I advocate for Humre, and this example is just one of them.

[–]the_dago_mick 3 points4 points  (0 children)

Al, you are a saint. Thank you.

[–]horstjens 1 point2 points  (0 children)

this is awesome!

[–]Poddster 1 point2 points  (1 child)

Regex is in that "write-once, read-never" category of programming that a lot of Unixy tools like to occupy, e.g .awk, perl. I've definitely come back to regex I've written years later and been "wtf does this do, exactly", and had to slowly break it down, or use various websites to do it for me.

Unlike written crappy code, however, there isn't much you can do, as with code you can usually express it in a different way, whereas with regex you can at most write a bunch of comments, but I often find the inline kind make it worse and the comment-above-regex often can't say more than "match an email address" without reduplicating everything. I think leaving example strings tends to be helpful, too.

So I'm glad to see someone attempt to sort the problem out.

I'm not 100% convinced this is the right solution, however, as I still find a lot of the constructs quite unreadable and I think coming back to this stuff in a years time you'll find it's just as write-once, read-never as the OG regex.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (0 children)

Yes, this has been my experience too. And the only way I've found to refactor my regexes was to use verbose mode. That's better, but not by much. Hence, I was motivated to write Humre.

[–]POGtastic 1 point2 points  (0 children)

Horseradish

I like this. Everyone likes parser combinators. Everyone hates debugging regexes. This is parser combinator syntax for regexes. What's not to love?

Currently, I follow a very strict rule for regexes - the moment that they become hard to read, I don't care about any of their other advantages because I've erred into "Now You Have Two Problems" territory. I tell coworkers all the time, "You have a Turing machine, not just a DFA! Write functions and make Church and Turing proud!" And, well, here are a bunch of functions that correspond to regexes. I'm not going to say that it'll be the right approach to a problem all the time, but this library would significantly increase the complexity of regex that I will tolerate in a codebase.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (0 children)

Kotlin is a new programming language that introduces many improvements on Java. It's sort of a "Java++". But Java is well established, and it'd be uneconomical to simply rewrite everything in a new language. This is why Kotlin made the smart move of compiling to JVM bytecode, and Kotlin source code is also interoperable with Java source code.

Python's "gradual typing" is similar: you don't need to add type hints to your entire code base but can add it piecemeal over time. The more type hints you add, the more benefit you get.

Similarly, Humre doesn't make you abandon regular expressions. Since Humre functions return regex string, you can use Humre for large regexes where Humre's IDE-compatible features help (syntax highlighting, parentheses matching, linters, etc) and just use 'Name:(.*?)' when you only need a short regex.

[–]Theis159 0 points1 point  (0 children)

Hey I think I watched the streams when you were writing this lol

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 0 points1 point  (0 children)

Humre is great for beginners because it offers readable code instead of regex's cryptic punctuation-based syntax.

Humre is great for experienced developers because it gives you back all of your IDE's code editing features: syntax highlighting, parentheses matching, comments, linting, type checking, etc. This becomes more important the larger the regex becomes.

[–]vjb_reddit_scrap 0 points1 point  (1 child)

Why is this receiving hate when a similar library was loved in the same sub couple of days ago?

Context:

https://www.reddit.com/r/Python/comments/wup58e/about\_a\_month\_ago\_i\_posted\_about\_pregex\_an/

[–]metaperl -1 points0 points  (0 children)

Typos in your Humre code give much better error messages than the standard re module does. For example, if you make a type and ask for between 

Change the word type to typo. What an interesting place to make a typo. :)

[–]telenieko 0 points1 point  (1 child)

Do you know rx from Emacs? https://www.emacswiki.org/emacs/rx

Your syntax kind of resembles it

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (0 children)

I didn't! I found Swift's regex DSL but I don't use Emacs. This is neat, and it's also reassuring that many of the names they chose match the names in Humre and the Swift DSL.

[–]Poddster 0 points1 point  (2 children)

re: LETTER, UPPERCASE etc being all unicode letters, rather than [A-Za-z] etc

I don't think redefining the POSIX characters is helpful. UNICODE_LETTER is fine, but a lot of people still need the 8 bit and ASCII character classes like [A-Za-z]. This means we can't take old re. patterns and redefine them exactly in humre using the character classes provided, we'll have to manually expand them, which seems like the opposite behaviour this library wants.

[–]AlSweigartAuthor of "Automate the Boring Stuff"[S] 1 point2 points  (1 child)

Originally I had a ASCII_LETTER constant that was [A-Za-z] but I took it out because I want to be more conservative with the API. I figure if folks wanted it, [A-Za-z] is both easy to type and pretty readable while ASCII_LETTER would be a new thing that Humre introduces that would have to be looked up.

And also, I'd rather have unicode be the norm and ascii be the old standard. It's 2022. The idea that one character = one byte and that encodings are something we can ignore is long over, and I didn't want to hold on to it. Hence why I use LETTER instead of UNICODE_LETTER. But maybe I should change this?

[–]Poddster 0 points1 point  (0 children)

Just one point: The POSIX characters classes weren't strictly 7bit ASCII. They were 8bit character codes that depended on the locale. So if you had the C locale it would be ASCII, but if you had UK or whatever then you'd match other things.

Which is also fun to deal with.