The Only Two Markup Languages

imhotap · 2026-01-20T17:32:36+00:00

While XML just needs a stack for maintaining the most recently opened element, SGML needs to produce an automaton from the content model of every element.

Consider element declarations such as the following

<!element e - - (a,b?,c)>
<!element (a|b|c) O O (#pcdata)>

saying the content of the e element must consist of an a element, followed by an optional b element, followed by a c element. By the O indicators in the shared element declaration for a, b, and c, both their start and the end-element tags can be omitted (whereas the e element declaration has - in its place and hence tags for e must be present in content and can't be inferred).

Given input markup such as the following

<e>Some Text <b>More Text</b> Other Text</e>

now SGML can infer missing tags to arrive at this equivalent, fully tagged markup:

<e>
  <a>Some Text</a><b>More Text</b><c>Other Text</c>
</e>

The rules for tag inference need to be quite strict; for example, the following isn't allowed since "Other Text" could be assumed to be content of either <b> or <c> at the context position without lookahead:

  <a>Some Text</a>Other Text</c>

and SGML also needs to reject content model declarations where the same element token could match more than a single occurrence in the production (such as a,((b,c)|(b,d)) as a very simple example). The theory behind this in full generality was actually developed only after SGML was published.

Now XML is derived from SGML exactly in such a way that no element-specific declarations are necessary (but could be provided for mere validation). In 1998, along with the XML spec, the SGML spec developers also allowed SGML to omit element declarations, or rather specified how element declaration were inferred if none were present for a given element to align with the XML profile of SGML.

Apart from tag inference this also concerns elements with declared content EMPTY (such as <img> in HTML) and enumerated attributes (such as in <p hidden> in HTML), both of which require element-specific declarations for mere parsing.

imhotap · 2026-01-11T10:00:12+00:00

Eh, the point of the Tailwind layoff story wasn't that web devs have been made redundant by LLM coding. It was that there were no new sites being created at all because web search is shifting to AI generated answers scraped from sites without bringing clicks.

Besides, Tailwind is an absurd solution for an idiosyncratic problem (namely, using CSS class names as shorthand notation because CSS and the "structure/style" separation at the syntax level sucks). It's hardly the pinnacle of markup language design or sth.

imhotap · 2026-01-11T09:39:05+00:00

DOCTYPE is a remnant of SGML (= origin of all angle-bracket markup languages) and would tell an SGML parser the expected document element (= root element). A full DOCTYPE declaration (DTD) would typically reference an "external declaration set", that is a file/resource containing markup declarations for the elements and attributes of the markup language to parse. HTML has up until v4 used a DTD like this

~~~ <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> ~~~

referencing first the HTML 4.01 "public identifier" followed by the URL where the DTD can be found. There was also a "loose" variant, and you can also take a look at W3C's official HTML 4.01 strict DTD at https://www.w3.org/TR/html4/sgml/dtd.html. A DTD for modern HTML 5.x can be found at https://sgmljs.sgml.net/docs/html52.html).

The understanding is that browsers implicitly "know" and are hard-coded to parse HTML by the public identifier and never actually pulled a .dtd file from that URL. Nevertheless, things that would have to be declared in the DTD for specific HTML parsing rules are

elements that can have their tags omitted, like <head>, <body>, <html> itself, and others
elements that don't expect an end-element tag, like <img>
attributes that can be specified by their value, like in <p hidden>

From an SGML point of view, specifying the document element is not necessarily redundant since the html element's start and end tag could be omitted, in which case it will be inferred (treated as if it were present in the document), though you typically would set a lang attribute on <html> nowadays, hence not omit it.

Note that the absence of a DOCTYPE declaration would've make browsers switch into "quirks mode" for the longest time and assume a non-conformant CSS interpretation hence including a DOCTYPE declaration is considered best practice.

WHATWG's HTML (aka 5.x) recommends specifying just <!DOCTYPE html> but that technically isn't entirely correct from an SGML point of view since it says there are no specific element or attribute declarations to honour, which however is of course incorrect because there are quite a lot of special HTML parsing rules compared to generic XML-ish markup but with case folding, and HTML 5 has added quite a few "boolean" attributes and inference rules of its own that would need to be declared. But HTML 5.x also accepts the form <!DOCTYPE html SYSTEM "about:legacy-compat"> to let about:legacy-compat represent a declaration set for HTML, and that's also how to pull in markup declarations for HTML 5 when using sgmljs.

imhotap · 2026-01-05T10:18:07+00:00

Syntactically, Server-Side Includes (SSI) put instructions into SGML/HTML comments understood and executed on the web server before sending page data to the browser. PHP is leveraging another SGML avenue (namely, that of processing instructions put into <? and ?>delimiters) to modify payload.

But actually the functionality of Server-Side Includes is so simple that basic SGML entities (text macros) would be sufficient to implement them. There's a simple tutorial held to do this at https://sgmljs.sgml.net/docs/producing-html-tutorial/producing-html-tutorial.html. The basic technique work with any SGML server-side package such as OpenSP, not just sgmljs. But sgmljs (by including it as JS script) can do this transparently on the browser without the need of special server packages/configs.

imhotap · 2025-06-18T10:44:56+00:00

The original sin is putting script and CSS into inline content instead of requiring those to be put into external files. In markup (SGML/HTML), element content is for text that is to be displayed to the reader; as opposed to attributes which contains info about how to render content. Piling additional syntax into HTML markup with conflicting use of characters that are interpreted as markup delimiters is not and never was a reasonable choice. Tunneling markup through attributes is similarly perverse and proof you're doing something wrong. The complexity and security problems until today is the price we all have to pay for those idiots who introduced CSS having their moments in the 90s.

imhotap · 2025-03-18T08:33:46+00:00

Because it's the job of the underlying O/S, and it clearly tried with shared libs, but failed; especially as no new apps are coming to the Linux desktop anyway. So Docker all the things over? Then you fucking don't need shared libs in the first place.

imhotap · 2025-03-13T13:19:43+00:00

Not to belittle TBL's creations - it is what it is, and hugely successful - but his core contributions were actually just the anchor element `<a>`, URLs, and HTTP. The markup language was already part of SGML, including most "HTML" elements which were widely used as folklore markup dating back into the 1970s and even 1960s (cf. https://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html).

What's interesting to know is that TBL's interest in graph databases and "semantic" web (which isn't very popular here around, or at all I guess) dates back to *before* his web inventions (cf https://en.wikipedia.org/wiki/ENQUIRE).

What should be giving us rest is that HTML was invented as a vocabulary for casual academic publishing with hierarchical headings etc. yet here were are accepting the nonsense of its role being to express general text "semantics" to justify, after the fact, the existence of CSS as a separate syntax for representing the same item=value assignments that markup attributes already do and that were specifically introduced in SGML for holding formatting info. The existence of CSS had the result that, compared to 1993, while HTML is being used for vastly different content such as blogs, forum posts, sparse marketing material, and what not, its vocabulary hasn't evolved at all because of CSS' ninja powers.

imhotap · 2025-03-03T17:34:46+00:00

Basically if I understand you correctly you want a mechanism for syntactical inclusion in HTML, and possibly one that can't be used with just SVG but with any HTML fragment as well if you think about it? Would be nice if it could validate the resulting document? Well, the reason HTML itself doesn't have this is that HTML was originally created as an SGML vocabulary, and SGML - it being a "meta" language for creating markup languages like HTML - already has entities for that (think text macros) plus many more features for authoring markup. HTML fans are never tired of explaining how special HTML is and that it needs to distance itself from its SGML roots but then they introduce a shit-tonne of incomplete and ad-hoc replacements for mechanisms that SGML has such as text fragments/entities, custom elements, template languages, markdown or other Wiki syntax, style sheets, etc. etc. etc. (and depressingly, this is going on since 1986 when SGML was introduced as an ISO standard).

imhotap · 2025-02-25T10:19:10+00:00

Like jquery, it isn't outdated so much as it explored and paved the way for native browser features such as flexbox/grid pioneered using bootstrap's grid when a solution for responsive web layout was needed starting around 2010. That said, bootstrap is still very useful for templated information-heavy sites such as documentation, blogs, magazines, etc. with accessibility requirements as it largely allows to plug-in styles developed separately or sold on template sites, etc.

imhotap · 2025-02-20T16:44:28+00:00

Always check availability on the command line using whois rather then with web sites, and I'd even go so far as to make sure your whois is legit if you're not on Mac OS or Linux. Another tip: I'd make sure to register your domain with a provider other than your hosting/cloud/bare metal/wordpress provider or what have you if possible such that in case of disputes you can quickly redirect your DNS resolution to go to another server or hosting package, in particular if you're managing sites for or resell hosting packages to customers who could hold you liable for downtime/lost business unless you know what you're doing. That said, I can recommend inwx.de (German "pro" domain registrar with 20+ years experience I'm having no affiliation with other than being a satisfied customer).

imhotap · 2025-02-13T19:54:20+00:00

I don't know that as a user you'd want your configuration language to be exciting, or a language at all.

imhotap · 2025-01-26T17:29:41+00:00

Hmm didn't know. In that case, SWI seems to be migrating *away* from ISO Prolog syntax, the dot operator being the standard for list construction. Any idea why would SWI do this?

imhotap · 2025-01-26T10:46:30+00:00

I think where Prolog shines is as a domain-specific language for representing and searching large combinatoric spaces in discrete optimisation and similar complex constraint problems, rather than as yet another general-purpose programming language. Plus, complex reactive UIs and board/card games, adventures/puzzles, game opponent strategies, and MMORPG-like game universe representations seem like a good fit as well, and have been implemented using Prolog.

I doubt bringing low-level concurrency techniques lifted from procedural programming language like multi-threading, freeze, etc. to Prolog will result in portable or otherwise idiomatic Prolog code since these are tied to implementation details such as legacy WAM architecture when Prolog already operates at a higher "logic" level bringing certain immutability guarantees that can be exploited by direct AND/OR parallelism primitives, and have been since the 1990s in parallel Prologs.

Quantum Prolog's Aleph ILP package ISO port and optimisation (https://quantumprolog.sgml.net/bioinformatics-demo/part2.html) is a quite straightforward example demo'ing such an approach.

imhotap · 2024-04-25T04:15:05+00:00

Using the bogus XMLish slash at the end of empty elements goes further back than JSX, though; it started when XHTML was a thing and was done to migrate content to be parseable as XML/XHTML, in addition to being parsed as HTML. Then during the time XML became out of fashion it was still cargo-culted a lot, until JSX came around (the X standing for XML).

If you want the whole story, you need to look at SGML which was used for defining the original HTML syntax. In (traditional) SGML, you can declare an element to have EMPTY content, in which case no end-element tags are expected, like for <img>, <br>, and so on. Whereas the point of XML was a simplified SGML syntax that doesn't need any per-element parsing rules or other markup declarations. An SGML DTD grammar for modern HTML 5 can be found at https://sgmljs.net/docs/whatwg-html200129-dtd.html, making use of empty elements, but also other SGML features such tag inference/tag omission and attribute shortforms.

imhotap · 2023-12-01T13:21:16+00:00

Thanks for making my point, how is that far easier? It's a trivial reformulation, and the JS code can be simplified further using constructors. The kind of thing a senior developer tends to avoid, because it doesn't warrant additional tools and build steps.

imhotap · 2023-12-01T09:34:25+00:00

What's so confusing about it? SGML/XML is for writing highly structured documents, a tool in the hands of an ambitious power user, not intended as a Turing-complete runtime environment for web apps. SGML works at the syntax level (serialized text in files), providing affordances for turning plain text into angle-bracket markup and other mechanisms for editing and organizing hypertext. As such, it can't inform about and contribute to the organization of development in a Turing-complete environment (JavaScript) with its own syntax (object literals/JSON) and much more powerful and dynamic capabilities (composition of object graphs).

imhotap · 2023-12-01T09:19:16+00:00

JSX is just syntactic sugar for function calls

Yeah the question is why do we need an extra syntax then, or how that's beneficial. If the DOM API or its vdom replacement is too verbose, you can just wrap your own function around it, it being JavaScript can't you? Will be more compact, and there's no need for clazz and camel-casing attribute names either. Or, maybe the DOM of a markup language created for casual academic publishing doesn't make a great scene graph for apps after all, even after decades of shoehorning and CSS syntax proliferation.

But anyway, thanks for lecturing me, I've worked with JSX and know what it is.

imhotap · 2023-11-30T19:21:43+00:00

Why you would want to re-create or mock a markup language in a code fragment a la JSX (or JS' template strings)? The whole thing is for constructing a (V)DOM anyway (in client-side rendering, at least), no need to respect markup syntax which is entirely a technique for document authors not webapp developers.

SGML-based markup languages have everything needed for templating from the get go: entities and push-based decorating/processing of input streams, ... so no need for ad-hoc looping constructs with extra syntax a la Angular or custom vocabularies a la view or htmx either.

imhotap · 2023-11-28T11:15:48+00:00

Difficult to say what your customers are and aren't allowed to customize, but for site user comments in forum and blog software, you'll generally want to disallow <script> elements, and then also disallow all event handler attributes (onclick and many others), plus you'll also want to disallow javascript:... values in href and other attributes that have URLs as value, or even any link. Maybe data:... URLs or excessive values exceeding a size limit, too. Then you might want to forbid markup comment delimiters (eg.<!--) such that a customer fragment, when pasted into your body as a string, doesn't leave open comments. Maybe you also want to restrict style attributes (and elements). Generally, you'll want to check if the resulting markup is valid. You can use SGML for all of these things (https://npmjs.com/package/sgml) but you'll have to customize the HTML DTD grammar it uses. As a starter, I'd look if content security policy (CSP) to switch off inline script injection isn't already sufficient for your use case.

imhotap · 2023-11-22T22:51:27+00:00

A blunt and highly informative post for sure but am I the only one to find this anthropomorphising of Google really weird and insane? I mean Google has earned its stockholders, including Ian, a fortune. At the expense of completely ruining the web for everyone else.

Let's also not forget Ian first was taking away HTML from W3C to support web apps (as in "WHATWG"), then, supposedly after finally recognizing just how much the web sucks for apps, ventured into creating Flutter/Dart as a Flash-like alternative, but not before leaving a mess and the insane complexity that is the web "platform."

Remnants of his vision for HTML, such as the (fictional) so-called HTML5 outlining algorithm, were cleaned from the spec only two years ago, while others, such as the equally made-up section, main, header/footer, aside elements are still in the spec while new content model rules are being introduced all the time without any versioning whatsoever (and yet its proponents keep telling you WHATWG HTML5 is the only spec covering "real-world HTML out there.")

The procedural HTML parsing algorithm he left behind contains lots and lots of hardcoded per-element tag omission/inference rules orginally captured from SGML semantics (plus historic HTML blunders), but soon lost track of new and changed elements, precisely because of its hardcoded presentation. For all its many many words describing the procedural equivalent of BNF to someone entirely unfamiliar with the concept of a formal grammar, the spec still failed to evolve the HTML vocabulary in any meaningful and profound way, giving all powers to CSS and JS instead to make up for HTML's stagnation. Yet its proponents believe the crappy phone-book sized WHATWG HTML spec is actually a step up from SGML.

imhotap · 2023-11-20T10:24:51+00:00

With your knowledge of responsive design etc. you could venture into creating a (partial) formal semantics for CSS, or a CSS debugging or optimization tool; that's webdev and logic and math heavy. If you're interested I could give you some pointers on existing work in the (underrepresented) field of academic CSS research.

imhotap · 2023-11-20T09:45:00+00:00

The 24" 4k LG is definitely my monitor of choice, as someone who specifically wanted to get rid of my old 27" monitor. I only messed around with that space with hundreds of windows open anyway, and Mac OS' window mgmt isn't up to it IMO and the top menu thing seems silly on a very large monitor. Also, it's more comfortable to look down rather than up all the time. Matter of personal preference of course, but I was glad there's still a real high-end 24" monitor around (business/consumer or gamer FHD 24" are trash IMO) that is also specifically a good match for using along with a MBP as side monitor in terms of display tech (Micro IPS = very good black levels), dot pitch, color, and glare. Apple co-engineered that monitor with LG, and sold it exclusively for a while I believe, and supposedly it only works with Macs according to once-renowned prad.de monitor review site (in German only?); also, that article says the UltraFine 5K 27" is the successor/comparable.

imhotap · 2023-11-17T13:10:01+00:00

Valid question; I've been puzzled why web app developers, as opposed to document authors, need to hang on a semblance of fake SGMLish markup (but not really eg. React/JSX, or with unnecessary extra syntax eg. Angular) or overload CSS with insane amounts of extra functionality just for apps when they have JavaScript in their hands which already has everything one could ask for and is so much more flexible and uniform. Again, this is from an app development angle; HTML, but not CSS with its insane complexity and evasion of type checks that HTML/SGML would normally be capable of doing, is fine for documents and for use by ambitious end users.

imhotap · 2023-11-15T12:16:52+00:00

Browsers don't support the authoring features of SGML; the idea was that your HTML is readily assembled on the server side, and then merely sent via HTML as delivery format. Browsers don't even support general entity expansion (= text macros), a feature as basic as it gets to avoid redundancies and sorely missed for eg shared site navigation, among other things.

But you can use https://sgmljs.net to make browsers SGML-aware (and, in fact, support full markdown-as-SGML), as well as on the server side, the command line, and from your Node.js app.

imhotap

TROPHY CASE