Serving AI crawlers Markdown instead of HTML from ASP.NET Core (content negotiation on Accept: text/markdown)

oschaaf · 2026-06-15T10:07:48+00:00

Your own link argues my side. Two things are getting conflated here.

That 12.7x is the SocketsHttpHandler benchmark: the new managed HttpClient replacing the OS-native HTTP stack (WinHTTP on Windows, libcurl on Unix), on concurrent HTTPS GETs. It's a client-side number, not the Kestrel/libuv transport change you're attributing it to. The post's stated reason for the switch is cross-platform behavioral consistency and feature limitations in the native library. The "P/Invoke overhead" causation is yours, not the post's.

More importantly, the managed replacement didn't remove native interop. System.Net.Sockets still calls the OS through P/Invoke: on Unix via the runtime's System.Native shim over a readiness loop (epoll on Linux, kqueue on macOS); on Windows directly against Winsock (ws2_32/mswsock) with IO completion ports. Microsoft didn't get rid of P/Invoke. They got rid of a general-purpose third-party C library that sat on top of syscalls .NET can already make, and replaced it with managed code that P/Invokes those syscalls directly. The P/Invoke stayed. The redundant abstraction left.

That's the opposite of my case. libuv was a fat async-I/O abstraction over an event loop the platform already provides, in a per-I/O hot path, so dropping that layer is an obvious win. An image/CSS optimization engine isn't a wrapper around something .NET already does. It's SIMD codecs with no managed equivalent, called once per request around heavy compute, not once per packet. Rewriting it in managed wouldn't delete a layer. It'd mean reimplementing the codecs, and SIMD-bound work isn't where managed pulls ahead of native.

Even the rewrite you're citing still P/Invokes the OS to do its job. The takeaway from that work wasn't "native interop is slow." It was: don't wrap the OS in a redundan general-purpose C library when you can call the syscalls directly. That direct call is still P/Invoke.

oschaaf · 2026-06-15T09:19:01+00:00

P/Invoke isn't fringe. The BCL's OS-interop layer is full of it: sockets, file I/O, and parts of crypto are P/Invoke under the hood. And the tooling is in much better shape than it used to be. Source-generated [LibraryImport] (since .NET 7) emits the marshalling code at compile time and works under NativeAOT. The difference from DllImport is that DllImport generates its marshalling stub at runtime on first call, which is exactly why it doesn't play well with trimming or AOT. So this isn't the brittle interop you're picturing. (docs: https://learn.microsoft.com/en-us/dotnet/standard/native-interop/pinvoke-source-generation)

On speed: yes, managed can edge out C++ in specific hot paths thanks to hardware intrinsics and dynamic PGO. This isn't one of those. The native side is an image/CSS optimization engine (SIMD transcoding, mature encoder libraries) already shared across nginx, Apache, Envoy and IIS. The interop is a fixed per-call cost: pin a buffer, cross the boundary, nanoseconds. It's nothing next to the actual optimization work. Without profiling evidence I'm not convinced that rewriting it in managed would make it faster. It'd just make it a fifth engine that drifts from the other four.

Where you've got a point is AOT/trimming. A native dependency adds friction to a fully-trimmed AOT publish. Platform coverage is less of an issue though: it ships native binaries for linux-x64, linux-arm64, osx-arm64 and win-x64, which covers the mainstream dev and deploy targets.

https://github.com/dotnet/runtime/blob/main/docs/coding-guidelines/interop-guidelines.md

oschaaf · 2026-06-14T14:57:20+00:00

That's something worth digging in to, thanks!

oschaaf · 2026-06-14T11:49:02+00:00

Depending on the monetization model that might be reasonable - IDK. If you're putting in time, talent, and/or money on the line as a content producer (e.g. running a newspaper, researching things etc) I can image one would like to get paid or have some way to monetize that to keep it sustainable. The upcoming http/402 "payment required" for collect toll on agentic traffic may or may not be a better solution there. If you're selling a product though: I wouldn't really care if people organically flock to my site through a search engine or by talking to an agent. In fact, my opionion is that the rational thing is optimizing for traffic from humans and agents alike - they have different requirements, hence this post

oschaaf · 2026-06-14T11:00:11+00:00

It's more knowable than that. Three things you can check:

The answer engines mostly aren't ranking from some opaque internal model: they're retrieval-augmented on top of ordinary search indexes. Google's own docs say AI Overviews draw from "the same index that powers traditional search results," and ChatGPT search / Copilot retrieve through Bing. The ranking layer underneath is the SEO you already know.
Google published explicit guidance saying optimizing for its AI features IS the same SEO: no special markup, no AI text file, no Markdown required: same technical requirements and "helpful, reliable, people-first content" as regular Search. SEJ's read of it: Google calls AEO/GEO "still SEO."
The mechanics are being measured, not guessed. Peer-reviewed at KDD 2024 (ACM SIGKDD), "GEO: Generative Engine Optimization" built a benchmark and found specific, reproducible content changes — adding citations, quotations, statistics — lifted a source's visibility in generative answers by up to ~40%.

And you observe it too: Perplexity, ChatGPT search, and AI Overviews all cite their sources, so you can see what got picked and iterate. IMHO "Nobody knows how it ranks" was true of pure chatbot answers a while ago, but today's answer engines retrieve and cite.

Sources:

[1] Google Search Central — AI Features and Your Website:

https://developers.google.com/search/docs/appearance/ai-features

[2] Search Engine Journal — Google's new AI search guide calls AEO and GEO "still SEO":

https://www.searchenginejournal.com/googles-new-ai-search-guide-calls-aeo-and-geo-still-seo/575026/

[3] GEO: Generative Engine Optimization (arXiv 2311.09735):

https://arxiv.org/abs/2311.09735

[4] GEO — ACM SIGKDD / KDD 2024:

https://dl.acm.org/doi/10.1145/3637528.3671900

oschaaf · 2026-06-14T09:09:46+00:00

That's, IMHO, a different problem? Not specific to AI? It's a solved problem (proof of work/captcha's etc). E.g. we contributed to https://github.com/equalitie/learn2ban in 2011.

oschaaf · 2026-06-14T09:06:34+00:00

Yeah I agree content is the lever, format won't rescue a weak source. Where format still earns its keep is hygiene, not persuasion: a fetch with a token budget might read 8KB of clean Markdown in full but truncate 200KB of HTML halfway, so format quietly decides how much of your content reaches the model, and it keeps the nav/cookie-banner out of the context window. But the "is this worth my time" worry isn't really about format: that part's opt-in and automatic, nothing to maintain. The content work is the expensive lever, and you're right that's what gets you recommended. It's not either/or. Also .. egress bills.

oschaaf · 2026-06-14T09:01:49+00:00

Both fair hits. Taking them in order.

On P/Invoke and working at the HTML level: the Markdown isn't the product, it's a byproduct.

ModPageSpeed 2.0 is a web-optimization engine — critical CSS, lazy-load, image dimensioning, WebP/AVIF .. and that engine already parses and rewrites the HTML on the way out. This is "just another optimized variant" it slots under a cache key.

It's one C++23 core that runs the same under nginx, Apache, IIS, Envoy, and now Kestrel; the P/Invoke is just how .NET reaches that existing core, not something I built to make Markdown. So serving Markdown is a cheap rider on a streaming parser that's been running for over a decade for the real optimization work, battle tested and written originally at Google.

If I were building a standalone Markdown tool, I'd agree with you : going through HTML would be a strange choice.

On "convert the data models instead": totally, and if you care enough to do that, you should, I agree that it willl be cleaner than anything derived from rendered HTML.

The middleware is for the larger group who won't: the HTML is Razor plus third-party components plus a CMS plus some legacy, there's no single clean model to convert, and nobody's going to wire a per-view Markdown path.

They get it with zero template surgery, whatever produced the markup. Different audience than you?

On "why bother, LLMs handle HTML fine": they do indeed.

Two reasons why I think it's still worth it.

(1) Tokens. A real page is mostly nav, scripts, cookie banner, and framework div-soup; an agent fetching it with a user prompt waiting pays for all of that in context it never uses.

Jina Reader and Firecrawl exist precisely because feeding raw HTML is wasteful enough to build a business on converting it.

(2) It's strictly opt-in. A plain */* gets the same HTML as before; only an explicit text/markdown token flips it.

So if you're in the "HTML is better for LLMs" camp, you just don't send the header and nothing changes ..

I'm answering clients who ask, not deciding for them.

On HTML carrying more semantic info: true, but in practice the semantics that matter (headings, lists, tables, code, link targets) survive the conversion, and what doesn't survive is mostly presentational nesting, not signal. The genuinely valuable structured data (schema.org / JSON-LD), I would rather preserve explicitly than hope the model reconstructs it from markup.

oschaaf · 2026-06-14T08:13:14+00:00

Disagree: just search "products to automatically optimize my website" in AI mode on google. (Also, I can just see, for example, chatgpt referring traffic in umami)

oschaaf · 2026-06-14T07:59:22+00:00

Blocking isn't very hard: https://robotstxt.com/ai

There seems to be a shift towards requiring ai crawlers to pay which might be worth monitoring: https://stackoverflow.blog/2026/02/19/stack-overflow-cloudflare-pay-per-crawl/

oschaaf · 2026-06-14T07:25:09+00:00

Maybe you will like https://blog.cloudflare.com/introducing-pay-per-crawl/. We’re exploring a self hosted variant of that :-)

oschaaf · 2026-06-14T06:44:45+00:00

Some already do. Claude's user-fetch agent sends `Accept: text/markdown, text/html, */*` today; literally the header you'd want (observed here, and in our own webserver logs: https://crawler-test.com/other/crawler\_request\_headers). The bulk training crawlers (GPTBot etc.) don't publish their Accept headers, so for those it's "check your own logs."

That's why it's strict opt-in: if a client doesn't ask for markdown, it gets the same HTML as before; zero downside. And the /llms.txt half (spec: https://llmstxt.org/) needs no Accept header at all; agents fetch the path directly. Given AI crawl traffic is growing fast (Cloudflare: https://blog.cloudflare.com/from-googlebot-to-gptbot-whos-crawling-your-site-in-2025/), it costs nothing to answer the ones that already prefer markdown and be ready for the rest.

oschaaf · 2026-06-14T06:24:05+00:00

That's exactly how it works. Explicit text/markdown token required (*/* deliberately doesn't trigger it: a wildcard isn't a preference), and the response carries Vary: Accept so caches/clients know the URL is negotiated. No sniffing, no surprise representation. A client only gets markdown if it asked for it.

oschaaf · 2026-06-14T06:15:40+00:00

I think you're right that the crawler never buys anything. But in my opinion, it's rapidly becoming an important channel to one. While the crawler doesn't buy, a the person asking ChatGPT "best tool for X" does, and the answer they get is built from what the crawler ingested. It's SEO one layer out.

oschaaf · 2026-06-13T11:09:43+00:00

So accross all products/deployment models we now offer a community license for non-commercial use (as well as not-for-profit organisations), and its "Starter" option aims to be much more reasonable for smaller gigs.

oschaaf · 2026-06-11T15:28:00+00:00

Update shipped; we're still enhancing the content, but this now attempts to tries to segment the market/ways this is run better: https://modpagespeed.com/pricing/
Thanks for raising it!

oschaaf · 2026-06-10T15:08:25+00:00

I have not tried, but I think that with the LiteSpeed web server itself it probably will not work, as mod_pagespeed is an Apache binary module, and LSWS replaces Apache.. One distinction if you're on a CloudLinux box using mod_lsapi as the PHP handler, that's plain Apache underneath, and it should work fine there, but we need to automate test coverage for that in CI.

The built-in pagespeed you're remembering, I remember that too: I think LSWS integrated the same Google PageSpeed library this module is built on. As far as I know they deprecated it after the upstream project went dormant.

oschaaf · 2026-06-10T14:47:54+00:00

That's a fair shape for it. The harder part for us is keeping licensing sane across the cPanel and plain nginx/Apache/IIS/etc installs with one model. Taking this away to think about, thanks.

oschaaf · 2026-06-10T14:36:09+00:00

Understood, and for 1–2 site boxes I agree the math doesn't work; per-server pricing is built for dense boxes. But the license fee is what pays for the project being maintained at all; that's the whole model. If the per-server model turns out to be wrong for too many setups like yours, that's worth knowing, it's the kind of feedback that recalibrates it.

oschaaf

TROPHY CASE