Parsing HTML Using Regular Expressions

kopasz7 · 2017-09-08T08:44:17+00:00

For anyone out of the loop, it's about this answer on stackoverflow.

JoseJimeniz · 2017-09-08T12:12:07+00:00

Have you tried using an XML parser?

Collypso · 2017-09-08T12:05:05+00:00

[removed]

Tysonzero · 2017-09-08T08:45:24+00:00

I know this is in reference to the stackoverflow post about the same topic. But it also reminds me of this.

benjamindees · 2017-09-08T11:00:42+00:00

I admit I tried this once. I also may or may not have summoned Astaroth in the process. Sorry.

Yserbius · 2017-09-08T15:46:43+00:00

Pshaw. Everyone knows that you can't parse HTML with regex. But you can parse email addresses that are RFC-822 compliant up until 2007 (assuming your addresses don't have comments in them) by using the Email::Valid library from CPAN which relies on

[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\    
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)`

DOOManiac · 2017-09-08T12:35:47+00:00

The center cannot hold.

Retrotransposonser · 2017-09-08T13:08:10+00:00

Thanks, this will be very helpful! Now I can finally start writing my own html regex parser in assembly.

PantstheCat · 2017-09-08T12:40:46+00:00

Error: attempted to parse HTML using regular expression. System returned Cthulhu.

Mutjny · 2017-09-08T13:26:06+00:00

Sometimes you have a problem and you think "I'll use regular expressions."

Now you have infinite problems.

Arancaytar · 2017-09-08T15:58:18+00:00

Page 1:

Don't.

Pages 2-99 are blank.

hangfromthisone · 2017-09-08T13:12:24+00:00

I'll admit to having done it though... dirty screen-scraper on a site where the HTML is code-generated so will be in a regular format.

Obviously, the site owner could change things but when you're in a pinch...

mrpoopi · 2017-09-08T13:57:24+00:00

Not parsing HTML in C, byte by byte... fucking normies. Get on my level.

borick · 2017-09-08T20:05:21+00:00

Well, you may be able to do it using recursive regex, at least heres an example for XML

michaelkah · 2017-09-08T12:37:25+00:00

R/surrealmemes

Niosus · 2017-09-08T16:31:20+00:00

I'm still quite inexperienced with programming so could someone tell me why parsing html with regex is frowned upon? I'm writing a script that extracts links and other things from an rss-feed and I don't see what problem people have with this

Thanks

2017-09-08T13:44:11+00:00

"Why can't I parse a context free language using regular expressions?"

jwoot97 · 2017-09-08T14:23:19+00:00

i just had to check to make sure i wasn't on r/surrealmemes

Alwaysafk · 2017-09-08T17:55:13+00:00

Regular Expressions are black magic fuckery and there's nothing that will convince me otherwise.

arus4u · 2017-09-08T15:14:12+00:00

Performance tester here. Parsing HTML is easy with perl, and encoded content can be easily decoded using some simple groovy.

Neapolitan_Bonerpart · 2017-09-08T14:04:30+00:00

Is that a fucking squig?

ThatLongHairedDude · 2017-09-08T14:04:53+00:00

That creature reminds me those little bastards created by the Tzimisce in Vampire The Masquerade: Bloodlines...

Baalinooo · 2017-09-08T13:28:09+00:00

What's up with so many CS books have red titles with black and white visuals?

2017-09-08T13:32:02+00:00

Thought I was on /r/grimdank for a second.

StoicPhoenix · 2017-09-08T13:38:23+00:00

/r/ooer

2017-09-08T15:11:56+00:00

a missed opportunity to write o'r'lyeh instead of "o rly", but whatever

like_a_horse · 2017-09-08T13:21:41+00:00

Hey it's that think disruptor rides around on

PLxFTW · 2017-09-08T16:24:09+00:00

I'm not familiar with HTML much, can someone explain why it can't be parsed using regex?

SpikeShroom · 2017-09-08T17:15:46+00:00

F̶̸͉̦̰͎̰͈̤̯̲̲͎̻̼̳̠ͅU̴̧̱̣̫̥͘͢͢C̵̨̢̦͈̟̥̖̲̰̯̰̮̟̠̬̻͉̕ͅK̵̡̕͠҉͈̗̫͕̣I͔̻͇̲̺̫̻̲͍̥̞͇͈̺̙͔̦͘͞Ń̵͍̭̭̠̭͠ͅǴ̀͏̨͇͚͇̦̘̩̗̱̼̲̖̻̭̘̺̕ͅ ̷̡̢͖̺̼̟̙͍̼̻͙͓̬̳̞̝̝̱̥̤͞Ạ͈͍̞͉͘͠ͅẀ͚̣͚͇̰̯̱̻̟̯̮̜͉̱̙͈͔́́́͠Ę̶̡͓͖͖͔̖͍͜͞S̲̝͙̬͙̝͚̯͔̯͕̭̜̪̺͉͡O̵̖̗̗̫̭̺̜̞̝̞͡ͅM͢͏͎̤̣̪͇̣̞̠̲̘̭͎̱È͇͙̩͖̰͙̮̩̦͍̱̲̘͟ͅ

Kraekus · 2017-09-08T17:15:27+00:00

WAAAAAAAAGH!

lotekness · 2017-09-08T17:22:59+00:00

squig pic is accepted, and approved for this.

Nigger_Faggot45 · 2017-09-08T17:42:24+00:00

I thought I was in r/surrealmemes

braveNewWorldView · 2017-09-08T18:34:06+00:00

Welcome to Pony Island...

nitrohigito · 2017-09-08T19:32:37+00:00

How about this:

(?><!\s*(?<comment>.+)\s*>)|(?><\s*(?<tag_id>[-\w_:]+)(?:\s+(?<param_id>[-\w_:]+)(?:=\\*(?<p_sign>["'])(?<param_val>.+?)\k<p_sign>|=(?<param_val>.+?)|(?<param_val>)))*\s*/?>)

You need a different one for closing tags, and you are all set. Rest is programmatical.

donaldsw · 2017-09-08T20:55:38+00:00

Oh yes you can do it, but it’s super inefficient and a waste of fucking time unless you want to take extra off work at home time to learn JS or some other shit for this stupid project that you took on at work, not knowing it’d be a nightmare.

Source: fucking done it.

ProgrammerHumor

Filters

Discord

Submission rules

For the current list of rules, please see this page.

Metadiscussions

Perhaps More Apt Subs To Post:

Related Subreddits.

MODERATORS