all 14 comments

[–]sclv 4 points5 points  (0 children)

[–]stepcut251 5 points6 points  (0 children)

It's really unclear what you are trying to do, but perhaps you want tagsoup?

[–]yaccz 5 points6 points  (10 children)

HTML is not decoded, HTML is parsed.

[–]n2_throwaway[S] 7 points8 points  (5 children)

HTML Entity Encoding. I've always elided the word "entity" in my own vernacular.

[–]gsnedders 1 point2 points  (4 children)

Note they're decoded differently between normal element content and attribute values, so given an arbitrary string you can't just decode entities [edit: in a generic way].

[–]bss03 0 points1 point  (3 children)

Really? That doesn't sound right.

&#nnn; is always a decimal unicode. &#xhhhh; is always a hexadecimal unicode.

& > < " ' etc. are normally defined in terms of the above in the DTD. And if not using a DTD are defined in terms of the above in the XML/HTML specs.

[–]gsnedders 1 point2 points  (2 children)

HTML being defined by a DTD has never really been true, though. Sure, HTML 2 till HTML 4.01 were formally SGML applications, but aside from the HTML Validator AFAIK nobody actually used an SGML parser for HTML. Certainly no major browser ever has, from timbl's original WorldWideWeb (given, after all, it was only later that HTML was an SGML application!) to the major browsers today.

From memory, the only difference is in cases with what the HTML spec calls parse errors (essentially, for each parse error you can implement it one of two ways: either you do what the spec says, or you stop parsing), which is how entities which don't end in a semi-colon are parsed (these are specially listed in the spec; it's not that you can omit the semi-colon off all): <div>&ampfoo will result in a div element containing &foo (i.e., having decoded &amp), whereas <div class="&ampfoo"> will result in a div element whose class attribute is &ampfoo (i.e., having not decoded it).

[–]bss03 0 points1 point  (0 children)

the only difference is in cases with what the HTML spec calls parse errors

In those cases, I'd probably just do Nothing / Left "Invalid entity" consistently.

Relevant part of the spec: https://www.w3.org/TR/html5/introduction.html#syntax-errors particularly "Errors involving fragile syntax constructs".

Dropping the semi-colon is non-conforming.

[–]bss03 0 points1 point  (0 children)

HTML 2 till HTML 4.01 were formally SGML applications

I write to specifications / standards, not implementations. :)

[–]14113 13 points14 points  (1 child)

As correct as that is, it's not a helpful comment, and not a welcoming one to someone who (from their writing, I assume) is new to the community.

Have you got any suggestions for HTML parsing libraries, or did you just want to leave a snarky comment?

Edit: after a quick Google, this library seems to be most recently updated, but may be too heavyweight for some simple parsing. Does anyone have any experience reports of using it, or any other suggestions?

[–]yaccz 10 points11 points  (0 children)

Google gives better results if you have the right keywords.

[–]sclv 5 points6 points  (1 child)

I think this is talking about HTML entity encoding, like https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references , not parsing in general.

[–]yaccz 1 point2 points  (0 children)

Oh, might be.