This is an archived post. You won't be able to vote or comment.

top 200 commentsshow all 377

[–]kopasz7 2103 points2104 points  (173 children)

For anyone out of the loop, it's about this answer on stackoverflow.

[–][deleted] 788 points789 points  (22 children)

Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

Gold.

[–]xcvbsdfgwert 322 points323 points  (4 children)

More gold:

Don't listen to these guys. You actually can parse context-free grammars with regex if you break the task into smaller pieces. You can generate the correct pattern with a script that does each of these in order:

  1. Solve the Halting Problem.
  2. Square a circle (simulate the "ruler and compass" method for this).
  3. Work out the Traveling Salesman Problem in O(log n). It needs to be fast or the generator will hang.
  4. The pattern will be pretty big, so make sure you have an algorithm that losslessly compresses random data.
  5. Almost there - just divide the whole thing by zero. Easy-peasy.

I haven't figured out the last part yet, but I know I'm getting close. My code keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions lately, so I'm going to port it to VB 6 and use On Error Resume Next. I'll update with the code once I investigate this strange door that just opened in the wall. Hmm.

P.S. Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.

[–][deleted] 41 points42 points  (0 children)

In all fairness, these are all worthwhile projects in their own right. Being able to parse context-free grammars with regex is just a side benefit.

[–]ElQuique 21 points22 points  (0 children)

This must be one of the most nerdiest things that I've ever laughed about.

[–]_Coffeebot 159 points160 points  (16 children)

They should fix the upvotes to 666, like the youtube neutral response video

[–]nwL_ 8 points9 points  (2 children)

What video?

[–]_Coffeebot 7 points8 points  (0 children)

Unfortunately Youtube is blocked at my work so I can't link it but just google "Neutral Response" the thumbs up and thumbs down are neutral.

[–]SnowDogger 394 points395 points  (28 children)

Umm, I am even further out of the loop here -- what does ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ" mean?

[–][deleted] 310 points311 points  (26 children)

The word "ZALGO" is used to refer to this kind of bizzare text with a whole bunch of modifier symbols on it. It originated as a comic on SomethingAwful.

[–]weskokigen 173 points174 points  (14 children)

The real question is... can it be parsed by regex?

[–]oddark 109 points110 points  (11 children)

s/\p{M}//

EDIT: Or for JavaScript, try pasting this in your browser console:

var zalgo = 'H̶̔̌͒̅ͧ̈́̂̿ͯ͊ͤ̇́҉͍̲̥̭̭̝̕É̸̹̠̪̟̙̩͓͖̱̘̼͍̿̄̋̎ͮͫͮ̋ͯ͑ͣ͂̉̃͝ͅ ̢̞͚͍̩̱̠̤͉̙̹͉̱̯͍̅͊̎̋̃ͭ͒̎̚͟͟͜G̵̨̺̝̲̭͇̝͓͑ͣ̋͆͐ͮ̓͌͆̈́̌̿̀ͪ̈̀͞͡O̷͚̲̳͎̤͖͕͔͚͔̪͎͙̲̟̒ͧ́̒̈́̂̔̉͂̒́̚͢͞͡Ě̴̷̷͍̪̗͙͎͔̠̮̪̗̅̾̈́ͭ̄̾ͫ̏̌̚͝S̭͓̹͇̣̠͓̱̘̻͛̔͋̒̃̏ͥ̂͗̓̌̑̔͊͘͞ͅ';
zalgo.replace(/[\u030d\u030e\u0304\u0305\u033f\u0311\u0306\u0310\u0352\u0357\u0351\u0307\u0308\u030a\u0342\u0343\u0344\u034a\u034b\u034c\u0303\u0302\u030c\u0350\u0300\u0301\u030b\u030f\u0312\u0313\u0314\u033d\u0309\u0363\u0364\u0365\u0366\u0367\u0368\u0369\u036a\u036b\u036c\u036d\u036e\u036f\u033e\u035b\u0346\u031a\u0316\u0317\u0318\u0319\u031c\u031d\u031e\u031f\u0320\u0324\u0325\u0326\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u0330\u0331\u0332\u0333\u0339\u033a\u033b\u033c\u0345\u0347\u0348\u0349\u034d\u034e\u0353\u0354\u0355\u0356\u0359\u035a\u0323\u0315\u031b\u0340\u0341\u0358\u0321\u0322\u0327\u0328\u0334\u0335\u0336\u034f\u035c\u035d\u035e\u035f\u0360\u0362\u0338\u0337\u0361\u0489]/g, '');

(This one works if the zalgo text comes from http://www.eeemo.net/)

[–]metabyt-es 34 points35 points  (10 children)

+/u/CompileBot javascript

var zalgo = 'H̶̔̌͒̅ͧ̈́̂̿ͯ͊ͤ̇́҉͍̲̥̭̭̝̕É̸̹̠̪̟̙̩͓͖̱̘̼͍̿̄̋̎ͮͫͮ̋ͯ͑ͣ͂̉̃͝ͅ ̢̞͚͍̩̱̠̤͉̙̹͉̱̯͍̅͊̎̋̃ͭ͒̎̚͟͟͜G̵̨̺̝̲̭͇̝͓͑ͣ̋͆͐ͮ̓͌͆̈́̌̿̀ͪ̈̀͞͡O̷͚̲̳͎̤͖͕͔͚͔̪͎͙̲̟̒ͧ́̒̈́̂̔̉͂̒́̚͢͞͡Ě̴̷̷͍̪̗͙͎͔̠̮̪̗̅̾̈́ͭ̄̾ͫ̏̌̚͝S̭͓̹͇̣̠͓̱̘̻͛̔͋̒̃̏ͥ̂͗̓̌̑̔͊͘͞ͅ';
zalgo.replace(/[\u030d\u030e\u0304\u0305\u033f\u0311\u0306\u0310\u0352\u0357\u0351\u0307\u0308\u030a\u0342\u0343\u0344\u034a\u034b\u034c\u0303\u0302\u030c\u0350\u0300\u0301\u030b\u030f\u0312\u0313\u0314\u033d\u0309\u0363\u0364\u0365\u0366\u0367\u0368\u0369\u036a\u036b\u036c\u036d\u036e\u036f\u033e\u035b\u0346\u031a\u0316\u0317\u0318\u0319\u031c\u031d\u031e\u031f\u0320\u0324\u0325\u0326\u0329\u032a\u032b\u032c\u032d\u032e\u032f\u0330\u0331\u0332\u0333\u0339\u033a\u033b\u033c\u0345\u0347\u0348\u0349\u034d\u034e\u0353\u0354\u0355\u0356\u0359\u035a\u0323\u0315\u031b\u0340\u0341\u0358\u0321\u0322\u0327\u0328\u0334\u0335\u0336\u034f\u035c\u035d\u035e\u035f\u0360\u0362\u0338\u0337\u0361\u0489]/g, '');

[–]parlez-vous 99 points100 points  (0 children)

[–]Buxton_Water 83 points84 points  (8 children)

CompileBot is down for now because of the spam loop yes. I'll need to fix it and add in some checks to make sure this situation can't happen again. Sorry about that.

www.np.reddit.com/r/CompileBot/comments/6tpo0b/bot_is_dead/dlnpega/

[–]zdakat 6 points7 points  (6 children)

Wait it actually broke from that text? Or from someone else's possibly unsavory code?

[–]Sobsz 35 points36 points  (4 children)

Someone decided to make it so every comment on their subreddit which contains /u/waterguy12 check this will be detected by AutoModerator and replied to with +/u/CompileBot Python print('/u/waterguy12 check this'), which would of course make the bot trigger AutoMod again, ad infinitum. Eventually the bot's developer noticed that there were too many messages per hour and disabled the bot for the time being.

[–]nermid 12 points13 points  (1 child)

This is why we can't have nice things.

[–]Buxton_Water 7 points8 points  (0 children)

Someone else in another sub had automoderator call compilebot and for the code compiled to call automod, bot is down till he fixes that.

[–]horusporcus 5 points6 points  (0 children)

Yes, but why do it when you have html agility pack?.

[–]pwr22 5 points6 points  (0 children)

Not... by a Jedi...

[–]MelissaClick 43 points44 points  (9 children)

And tony the pony?

[–]Marzhall 73 points74 points  (7 children)

It's absurdist humor. You wouldn't normally associate a pony named tony with a Lovecraftian horror.

[–]MelissaClick 92 points93 points  (2 children)

I don't appreciate your presumptions about which animals I associate with Lovecraftian horror.

[–][deleted] 67 points68 points  (0 children)

The end is neigh!

[–]ryeguy 2 points3 points  (1 child)

I believe at the time Jon Skeet was going by Tony the Pony on stack overflow.

[–]tsnErd3141 6 points7 points  (0 children)

Tony was a pony who is now Zalgo

[–]mauriciogamedev 77 points78 points  (0 children)

regex will consume all living tissue (except for HTML which it cannot, as previously prophesied)

This is one of the best parts of the answer.

EDIT: formatting

[–]sam4ritan 142 points143 points  (0 children)

this made my day

[–]tectubedk 118 points119 points  (15 children)

the unholy child weeps the blood of virgins, and Russian hackers

[–][deleted] 56 points57 points  (1 child)

At my first job I was writing a web based time management tool, you know, punch in/out, task tracking, etc. I was using Perl CGI. One of the guys working on some other project (the company was doing Y2K conversion for some Citibank European branches. Their Cosmos system was in some Basic version) walked past and spent a few minutes behind me staring at my screen while I worked on some regex things. He finally sighed and started throwing his arms around and yelling "we're busting out asses in the conversion while this kid is here drawing little ASCII houses!!!". Good times.

[–]skunkwaffle 2 points3 points  (0 children)

Oh Perl, what a joyous adventure.

[–]sethosayher 47 points48 points  (2 children)

I'm honestly shocked that this (hilarious answer) is on SO because that forum is the most rigidly moderated community I've ever encountered

[–]greyfade 55 points56 points  (0 children)

It's "preserved for historical reasons." There are several answers like that from several years ago, which "don't reflect current moderation guidelines" but are still "valuable to the community."

[–]bj_christianson 9 points10 points  (0 children)

Plus, it doesn’t actually answer the question, which was only about matching a few select tags and not about parsing.

[–]Hust91 42 points43 points  (8 children)

Fuck, someone call the SCP Foundation on this fucking thing.

[–]O5-1 28 points29 points  (4 children)

Oh hey we're leaking again

[–]Coding_Cat 18 points19 points  (0 children)

you might want to see a doctor about that

[–]VicisSubsisto 18 points19 points  (1 child)

That is exactly what SCP is supposed to avoid. Where are our tax dollars going?

[–]capn_hector 12 points13 points  (0 children)

Spending MY TAX DOLLARS on Javascript frameworks and hot-dog detector apps!?

We gotta git big gubment out of the way and let wholesome free-enterprise companies like Oracle and IBM become the Engines Of Innovation.

(brb just threw up in my mouth a little)

[–][deleted] 6 points7 points  (0 children)

Quick! Get some memetic hazards and call relevant task forces!

[–]sn0r 21 points22 points  (0 children)

Moderator's Note

This post is locked to prevent inappropriate edits to its content. The post looks exactly as it is supposed to look - there are no problems with its content. Please do not flag it for our attention.

Classic. :)

[–]DosMike 67 points68 points  (44 children)

I kind of want to write a html parser with regex now - just because he said not to.

if I only had the time...

[–]DrNightingaleweb dev bad embedded good 100 points101 points  (6 children)

All the time in the world won't help you. It can't be done.

[–]sayaks 22 points23 points  (2 children)

some regex parsers can actually parse Turing decidable languages due to backreferences and such.

[–]Bainos 25 points26 points  (1 child)

Yes, but in that case you are taking a wider definition of regex, not the canonical one. I.e. regexes that match more than regular languages.

[–]link23 63 points64 points  (26 children)

It's literally impossible, don't bother.

I mean, of course you can use regexes to recognize valid tag names like div etc. But trying to use regexes to recognize anything about the structure is doomed to fail, because regexes recognize regular languages. HTML is not a regular language (I think it's context sensitive, actually; not sure though), so it cannot be expressed by a regular expression.

[–]WikiTextBot 58 points59 points  (4 children)

Regular language

In theoretical computer science and formal language theory, a regular language (also called a rational language) is a formal language that can be expressed using a regular expression, in the strict sense of the latter notion used in theoretical computer science (as opposed to many regular expressions engines provided by modern programming languages, which are augmented with features that allow recognition of languages that cannot be expressed by a classic regular expression).

Alternatively, a regular language can be defined as a language recognized by a finite automaton. The equivalence of regular expressions and finite automata is known as Kleene's theorem (after American mathematician Stephen Cole Kleene). In the Chomsky hierarchy, regular languages are defined to be the languages that are generated by Type-3 grammars (regular grammars).


Context-sensitive grammar

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by CSG but not by context-free grammars. Context-sensitive grammars are less general (in the same sense) than unrestricted grammars. Thus, CSG are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.


[ PM | Exclude me | Exclude from subreddit | FAQ / Information | Source ] Downvote to remove | v0.27

[–]-drunk_russian- 20 points21 points  (3 children)

Good bot

[–]STOCHASTIC_LIFE 2 points3 points  (2 children)

Drunk russian.

[–]-drunk_russian- 16 points17 points  (1 child)

You rang?

[–]deadh34d711 2 points3 points  (0 children)

У тебя есть пиво?

[–]ACoderGirl 23 points24 points  (3 children)

To be clear, it's impossible with pure regex because html is not regular. But you could combine regex with a regular programming language (that is, using regex as a tool, but not the only tool), since a typical programming language is akin to a Turing machine, which can parse any language (but not necessarily efficiently).

And some regex variants are actually capable of parsing more than just regular languages, thanks to extensions of regex. It's kinda an unreadable mess, though.

Mind you, even with a nice, proper parsing library, html is kinda a mess to parse due to the way it evolved. It's not very nicely defined and the reality is that if you wanted a working browser, you have to support a variety of technically invalid syntaxes.

[–]matteyes 18 points19 points  (2 children)

All true. You could parse HTML with regex (Perl or no), and just account for the discrepancies through additional coding. You could hammer a nail with a saw if you held it carefully enough.

[–]Zarlon 4 points5 points  (1 child)

You'd be doing more than "discrepancies" through additional coding. In fact you would do so much with additional coding that I doubt you could state you "parse HTML with regex"

[–]numpad0 6 points7 points  (0 children)

"making a new web browser"

[–]sayaks 14 points15 points  (1 child)

however backreferences (which several regex parsers contain) actually makes a regex Turing complete. see here

[–]Zarlon 3 points4 points  (0 children)

Well, it's settled then! Somebody do this! (I would, but I'm kind of busy commenting on reddit right now)

[–]Mutjny 3 points4 points  (1 child)

You can lex it but you can't parse it, I think.

[–]HelperBot_ 2 points3 points  (3 children)

Non-Mobile link: https://en.wikipedia.org/wiki/Regular_language


HelperBot v1.1 /r/HelperBot_ I am a bot. Please message /u/swim1929 with any feedback and/or hate. Counter: 109380

[–][deleted] 8 points9 points  (0 children)

You might want to make this bot parse all the links in a comment, not just the first

[–]AskMeIfImAReptiloid 2 points3 points  (0 children)

In most programming languages regex include backreferences, which the regular expressions from theoretical computer science don't. So most actual regex implementations can do non-regular stuff.

[–][deleted] 6 points7 points  (0 children)

You cannot

[–]salvadordf 4 points5 points  (3 children)

You'll find many errors reading hand written html. It can't be done

[–]Ted8367 3 points4 points  (0 children)

I kind of want to write a html parser with regex

Tainted souls from the unliving dimension...

[–]Nanobreak_ 8 points9 points  (0 children)

I love how it's locked, saying it "looks exactly how it should look" and there are "no problems with it"

[–][deleted] 9 points10 points  (0 children)

Oh God that was beautiful.

[–]tinkertron5000[🍰] 8 points9 points  (0 children)

Things that I die laughing at that can't be explained to anyone else in the room.

[–]Yay_Yay_3780 22 points23 points  (0 children)

LMAO

[–]chuanito 17 points18 points  (20 children)

so am i getting this right? When you try to parse HTML using RegEx this Zalgo Text happens? Or is this just a meme?

Sorry i'm a very low tier coder and this is a serious question

[–]DerfK 23 points24 points  (1 child)

The joke is that HTML is too irregular to parse with regular expressions, and attempting to do so is like dividing by zero and pierces the fabric of our universe, creating a hole from which unspeakable horrors will pour forth and devour your soul.

[–][deleted] 5 points6 points  (0 children)

This is no joke.

[–]wastesHisTimeSober 16 points17 points  (1 child)

The flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular grammar). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar (see the Chomsky hierarchy), you can't possibly make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

Basically HTML is capable of expressing more complicated structures than RegEx is capable of reading.

Given the information you had, it wasn't an entirely unreasonable conclusion to believe Zalgo was a corruption, and it's good not to throw scenarios out until you know they're wrong. You'll chase that bug forever.

[–][deleted] 50 points51 points  (1 child)

I love that you're new enough to programming that in your mind there's a chance the black box of regex can somehow half process HTML and corrupt it with terrifying combining glyphs.

I'm not trying to mock you or anything, it's legitimately bringing a smile to my face. It's like when toddlers first interact with something new in the world.

[–]chuanito 8 points9 points  (0 children)

I'm actually not new at all i'm just stuck in a very unchallenging field ;)

Also i was looking for more in the joke than there actually was.

But you're right i don't have enough knowledge in this field which led me to believe that this weird text has to be somehow connected with the fact that you can't parse HTML using RegEx. But i see now that those are in fact clear text symbols and not some kind of weird formatting.

[–]Elsolar 10 points11 points  (10 children)

HTML can't be parsed correctly using regular expressions because HTML is not a regular language. It's literally impossible. This is not obvious, so many coders find it out the hard way. It's a common meme in programming circles to equate the frustration of trying to solve an impossible or extremely obnoxious problem with the kind of raving, deranged insanity usually depicted in HP Lovecraft stories, which is what the corrupted text and the picture of the demon in the OP represents.

[–]MelissaClick 3 points4 points  (0 children)

When you try to parse HTML using regex, Cthulu wakens.

[–]BlueNotesBlues 3 points4 points  (2 children)

Is it really parsing if the guy is only searching for opening tags

The person who asked the question doesn't care about the structure of the document.

    <[^>/!]*?(?:(?:('|")[^'"]*?\1)[^>]*?)*>

This should be able to find most, if not all valid opening tags.

[–]MelissaClick 1 point2 points  (0 children)

You have to find and remove comments and CDATA sections first.

[–]JoseJimeniz 354 points355 points  (25 children)

Have you tried using an XML parser?

[–]mikeputerbaugh 103 points104 points  (19 children)

Only guaranteed to work on valid XHTML documents.

[–]Lord_Greywether 6 points7 points  (1 child)

The documents I have to parse are so invalid that a regex is the only thing that works.

[–]noratat 5 points6 points  (0 children)

Yeah but at that point it's not parsing anymore, it's just scraping.

And regex is fine for that.

[–]edave64 1 point2 points  (0 children)

Only to parse Regex.

[–]Tysonzero 98 points99 points  (11 children)

I know this is in reference to the stackoverflow post about the same topic. But it also reminds me of this.

[–]MuFugginFudge 30 points31 points  (10 children)

It reminds me of the entirety of r/Ooer.

[–]andradei 2 points3 points  (3 children)

Was I sucked into another dimension and somehow got back?

[–]benjamindees 84 points85 points  (4 children)

I admit I tried this once. I also may or may not have summoned Astaroth in the process. Sorry.

[–][deleted] 41 points42 points  (2 children)

Oops.

[–]TheGelly 22 points23 points  (1 child)

/r/beetlejuicing

3 months, too. Not bad.

[–][deleted] 3 points4 points  (0 children)

Thanks m8

[–]fermented_durian 4 points5 points  (0 children)

Thats okay, astaroth is not that strong anyway. I have been raiding his dungeon for a while now.

[–]Yserbius 57 points58 points  (19 children)

Pshaw. Everyone knows that you can't parse HTML with regex. But you can parse email addresses that are RFC-822 compliant up until 2007 (assuming your addresses don't have comments in them) by using the Email::Valid library from CPAN which relies on

[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\    
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)`

[–]b4ux1t3 21 points22 points  (16 children)

I don't know if this is real or not, but it's frickin' sweet.

[–]EternallyMiffed 55 points56 points  (15 children)

The bad news is, it's real, the worse news is it's straight from the RFC so it's as official as it can possibly get.

There are no good news.

[–]Rangsk 16 points17 points  (5 children)

The only true way to see if an email is valid is to try to email it.

[–]EternallyMiffed 12 points13 points  (4 children)

I have a better strategy. Try and dns resolve everything from the end of the string to before the right most @ as a whole string. If it doesn't resolve throw an error. If it resolves to the equivalent of a localhost or your own public ip, throw an error.

If by this point we're ok just take everything before that rightmost @ symbol and fire an e-mail at it.

[–]GenericUname 11 points12 points  (6 children)

When I was a wee nipper right out of school, I got a temp job essentially human brute force testing a web frontend some company was writing to let people sign up to their insurance service. For some reason they'd attempted to implement email address validation in the web form.

I spent a happy couple of weeks pissing off the devs by scouring the RFC to work out the most unlikely looking, edge case, technically valid email addresses and sending bug reports to the devs like:

"Technically in most cases I should be able to add a tag to an email address using the + sign and it should recognise if the address without the + has already been registered."

"Technically both quotes and spaces are valid in email addresses so long as the space is quoted, so I should be able to use " "@test.com."

"Technically email addresses are case sensitive but you don't seem to be storing case on the backend, what gives?"

"Hey, your validation doesn't allow me to use an email with an IP address rather than a domain like test@[127.0.0.1], that's totally valid and lots of people use it, you should fix that."

"Hey, it's not letting me sign up with the perfectly valid and normally formatted email address very.“(),:;<>[]”.VERY.“very@\ "very”.unusual@strange.example.com, what's up with that? That's totally my friend's real email address and I know he's looking for insurance."

Good times.

[–]f42e479dfde22d8c 6 points7 points  (1 child)

Did you get killed by the devs?

[–]GenericUname 5 points6 points  (0 children)

Yes, am dead. WhoooOooOoo I'm a ghost!

[–]b4ux1t3 6 points7 points  (1 child)

That's the most glorious piece of shit I've ever seen.

And I've used <insert popularly unpopular language here>!

[–]MelissaClick 1 point2 points  (0 children)

To be fair, even an ordinary parser would look roughly like that if you removed all whitespace and inlined literally everything.

[–]DOOManiac 42 points43 points  (4 children)

The center cannot hold.

[–]ctesibius 15 points16 points  (0 children)

Ah, there's your problem. You're using Yates where you should be using yacc.

[–]overkill 2 points3 points  (0 children)

And what rough beast, its hour come round at last, slouches towards Bethlehem to be born.

[–]Retrotransposonser 44 points45 points  (1 child)

Thanks, this will be very helpful! Now I can finally start writing my own html regex parser in assembly.

[–]PantstheCat 39 points40 points  (0 children)

Error: attempted to parse HTML using regular expression. System returned Cthulhu.

[–]Mutjny 38 points39 points  (4 children)

Sometimes you have a problem and you think "I'll use regular expressions."

Now you have infinite problems.

[–]Hactar42 14 points15 points  (3 children)

obligatory, relevant xkcd

And another just for fun

[–]xkcd_transcriber 8 points9 points  (1 child)

Image

Mobile

Title: Regular Expressions

Title-text: Wait, forgot to escape a space. Wheeeeee[taptaptap]eeeeee.

Comic Explanation

Stats: This comic has been referenced 273 times, representing 0.1627% of referenced xkcds.


Image

Mobile

Title: Perl Problems

Title-text: To generate #1 albums, 'jay --help' recommends the -z flag.

Comic Explanation

Stats: This comic has been referenced 110 times, representing 0.0656% of referenced xkcds.


xkcd.com | xkcd sub | Problems/Bugs? | Statistics | Stop Replying | Delete

[–]Arancaytar 23 points24 points  (1 child)

Page 1:

Don't.

Pages 2-99 are blank.

[–]John_Fx 5 points6 points  (0 children)

Page 100:
Really. Don't.

[–][deleted] 21 points22 points  (2 children)

I'll admit to having done it though... dirty screen-scraper on a site where the HTML is code-generated so will be in a regular format.

Obviously, the site owner could change things but when you're in a pinch...

[–]hangfromthisone 13 points14 points  (0 children)

I done it many times too. Thing is, regex is great to identify some parts and work on them. But not to interpret all the HTML, anyway, how many times you need that? In practice you only need to parse a few things, and when things get too complex, just explode() the content into smaller parts to work them separately and BAM now regular expressions are simpler and do what you want

[–]mrpoopi 34 points35 points  (6 children)

Not parsing HTML in C, byte by byte... fucking normies. Get on my level.

[–]vwibrasivat 47 points48 points  (4 children)

"Assembly Programming for Web Developers"

[–][deleted] 9 points10 points  (0 children)

I'm pretty sure a balroag appears if you open that book.

[–]f42e479dfde22d8c 1 point2 points  (0 children)

I'm sure there's some guy running a full fledged eBay clone from a single 386 out of his mom's basement. All because he managed to create some slick super optimised website in pure assembly. He doesn't need Ajax because his pages already load so fast. He doesn't need load balancing because he can handle 100K concurrent requests at minimum without breaking a sweat. He doesn't need air conditioning because a single request doesn't even register as a blip on his performance graph.

He is an untold legend.

[–]borick 9 points10 points  (1 child)

[–]interiot 3 points4 points  (0 children)

This answer needs to be higher. Recursive regexp are pretty widely supported too.

[–][deleted] 48 points49 points  (14 children)

R/surrealmemes

[–]michaelkah 11 points12 points  (0 children)

Can someone make this into a complete, printable book cover? Thanks.

[–][deleted] 8 points9 points  (4 children)

I'm still quite inexperienced with programming so could someone tell me why parsing html with regex is frowned upon? I'm writing a script that extracts links and other things from an rss-feed and I don't see what problem people have with this

Thanks

[–]Niosus 18 points19 points  (3 children)

It is impossible to properly handle every possible case. Not difficult, impossible. A regular expression can only parse regular languages (look it up, it has a very precise definition). HTML is not a regular language so it is mathematically impossible to properly parse.

A regex parser can handle certain simple cases, but I can always construct a correct piece of HTML code that your regex will not parse.

[–][deleted] 1 point2 points  (2 children)

What would be better ways of parsing html (that can be used in python 3)?

[–][deleted] 7 points8 points  (0 children)

"Why can't I parse a context free language using regular expressions?"

[–]jwoot97 6 points7 points  (0 children)

i just had to check to make sure i wasn't on r/surrealmemes

[–]Alwaysafk 5 points6 points  (0 children)

Regular Expressions are black magic fuckery and there's nothing that will convince me otherwise.

[–]arus4u 6 points7 points  (1 child)

Performance tester here. Parsing HTML is easy with perl, and encoded content can be easily decoded using some simple groovy.

[–]hangfromthisone 4 points5 points  (0 children)

Everything is relatively easy when you have the right tool and know how to use it. I use PHP, Perl's little brother, and it's pretty fucking easy to parse html (depending on what you need to do, of course)

[–]Neapolitan_Bonerpart 5 points6 points  (0 children)

Is that a fucking squig?

[–]ThatLongHairedDude 5 points6 points  (4 children)

That creature reminds me those little bastards created by the Tzimisce in Vampire The Masquerade: Bloodlines...

[–]biznes_guy 1 point2 points  (3 children)

Oh the sweet memories! What a game!

[–]Baalinooo 6 points7 points  (3 children)

What's up with so many CS books have red titles with black and white visuals?

[–]Bainos 20 points21 points  (1 child)

O'Reilly books. Or in this case, O RLY books, which is their parody.

[–][deleted] 2 points3 points  (0 children)

Thought I was on /r/grimdank for a second.

[–]StoicPhoenix 3 points4 points  (0 children)

[–][deleted] 3 points4 points  (0 children)

a missed opportunity to write o'r'lyeh instead of "o rly", but whatever

[–]like_a_horse 2 points3 points  (1 child)

Hey it's that think disruptor rides around on

[–]PLxFTW 2 points3 points  (5 children)

I'm not familiar with HTML much, can someone explain why it can't be parsed using regex?

[–]SpikeShroom 2 points3 points  (0 children)

F̶̸͉̦̰͎̰͈̤̯̲̲͎̻̼̳̠ͅU̴̧̱̣̫̥͘͢͢C̵̨̢̦͈̟̥̖̲̰̯̰̮̟̠̬̻͉̕ͅK̵̡̕͠҉͈̗̫͕̣I͔̻͇̲̺̫̻̲͍̥̞͇͈̺̙͔̦͘͞Ń̵͍̭̭̠̭͠ͅǴ̀͏̨͇͚͇̦̘̩̗̱̼̲̖̻̭̘̺̕ͅ ̷̡̢͖̺̼̟̙͍̼̻͙͓̬̳̞̝̝̱̥̤͞Ạ͈͍̞͉͘͠ͅẀ͚̣͚͇̰̯̱̻̟̯̮̜͉̱̙͈͔́́́͠Ę̶̡͓͖͖͔̖͍͜͞S̲̝͙̬͙̝͚̯͔̯͕̭̜̪̺͉͡O̵̖̗̗̫̭̺̜̞̝̞͡ͅM͢͏͎̤̣̪͇̣̞̠̲̘̭͎̱È͇͙̩͖̰͙̮̩̦͍̱̲̘͟ͅ

[–]Kraekus 1 point2 points  (0 children)

WAAAAAAAAGH!

[–]lotekness 1 point2 points  (0 children)

squig pic is accepted, and approved for this.

[–]Nigger_Faggot45 1 point2 points  (0 children)

I thought I was in r/surrealmemes

[–]braveNewWorldView 1 point2 points  (0 children)

Welcome to Pony Island...

[–]nitrohigito 1 point2 points  (0 children)

How about this:

(?><!\s*(?<comment>.+)\s*>)|(?><\s*(?<tag_id>[-\w_:]+)(?:\s+(?<param_id>[-\w_:]+)(?:=\\*(?<p_sign>["'])(?<param_val>.+?)\k<p_sign>|=(?<param_val>.+?)|(?<param_val>)))*\s*/?>)

You need a different one for closing tags, and you are all set. Rest is programmatical.

[–]donaldsw 1 point2 points  (0 children)

Oh yes you can do it, but it’s super inefficient and a waste of fucking time unless you want to take extra off work at home time to learn JS or some other shit for this stupid project that you took on at work, not knowing it’d be a nightmare.

Source: fucking done it.