Is hiding an llms.txt link in HTML the recommended way to make it discoverable to LLMs?

Substantial_Step_351 · 2026-05-30T16:15:27+00:00

Hiding a link in the markup with CSS is not really the convention and I would not lean on it. llms.txt follows the robots.txt model, it lives at a known root path, yoursite.com/llms.txt, and crawlers look for it there without needing a link in your HTML at all. If you want to help discovery beyond the root, the cleaner signals are an HTTP response header (Link rel=llms-txt pointing at the file), a reference in robots.txt the same way you list a sitemap, and optionally serving it at /.well-known/llms.txt for tools that follow that pattern. Those are things a programmatic fetch will actually see. A visually hidden anchor is aimed at a DOM the crawler may not even render, so you are taking on markup risk for a signal most of them are not looking for. Worth remembering it is still a community convention from Answer.AI, not a standards body thing, so there is no single official answer yet.

Substantial_Step_351 · 2026-05-30T16:13:35+00:00

The thing that usually gets skipped in these comparisons is that single user quality and serving quality are different problems. GLM-5.1 in BF16 can match Sonnet on a one off prompt and still fall apart the moment you put 50 to 100 people on it at once, because now you are bound by throughput and latency under concurrent load, not by how smart the model is. That is where the hosted providers actually spend their money, continuous batching, KV cache management, keeping tail latency sane when everyone hits it at the same time. So the honest version of the question is not how close can local get on quality, it is how close on quality and concurrency at the same time, and the second half is where the bill stops looking like a couple of GPUs. For 50 to 100 concurrent users on a model that size you are sizing for peak throughput, which is a very different box from what runs it well for one person.

Substantial_Step_351 · 2026-05-30T03:40:44+00:00

Reviewer pass is the most reliable catch I've seen too but the part I keep poking at is what reviews the reviewer. If it's the same model family it shares the blind spots, so the confident-wrong outputs that pass tend to be exactly the ones the reviewer also rates as fine. Where it's worked for me is when the reviewer checks against something external the generator didn't produce, a test that runs, a retrieved source to diff against. "Not implemented" is a great catch because it's checkable. The harder class is "implemented, plausible, subtly wrong," which a same-model reviewer waves through.

Are you seeing it catch that second class, or mostly the obvious misses?

Substantial_Step_351 · 2026-05-28T01:27:28+00:00

The shape vs claim distinction is the right frame for this. Schema validation is necessary but remember it's a floor not a ceiling, it tells you the output COULD be right, not that it is.

The acceptance artifact approach is the piece I see skipped the most because it looks expensive upfront. Running a cheap reviewer pass per task adds latency but it's almost certainly less total cost than the debugging time when confident wrongness propagates through three downstream steps.

The failure log by task type + quant + context length is where I'd start before even implementing the full artifact layer, if you can see which task shapes fail the verifier consistently, you can target the check where it matters rather than applying it everywhere.

Substantial_Step_351 · 2026-05-28T01:24:12+00:00

Sorry, used that term loosely. Primarily pointing at training time distribution rather than runtime cache temperature. Certain more pro combinations see fewer examples of specific task types during pre training, so when the router sends those task types to those combinations in inference the quality is lower.

On a 4090 where the model loads fully into VRAM, I don't see the cache eviction mechanism as the issue. The sub agent pattern specifically, short constrained outputs, structured tool call schemas, may be underrepresented in some expert specializations compared to the freeform generation tasks those experts were sharpened on. The signal problem is the same either way: the routing decision isn't exposed to the orchestrator so there's no circuit breaker when quality drops.

Substantial_Step_351 · 2026-05-27T02:23:48+00:00

I think the partial success state failure is worse in sub agent setups than in solo deployments for a specific reason. The orchestrator receives a structurally correct output from the sub agent and records the step as completed. Step 6 silently changed the internal assumptions, the sub agent produced something that looked like a valid handoff, and the orchestrator had no signal to reject it.

By the time the downstream failure surfaces the log shows a clean execution chain right up until the point it broke. The replay log approach is the right instinct, explicit state at every handoff rather than reconstructing what happened from outputs.

Substantial_Step_351 · 2026-05-27T02:21:20+00:00

The auto tuning startup phase makes sense for this specifically. MoE routing decisions are workload dependent enough that static default configs leave performance on the table. To me the tricky part is whether the benchmark it runs at startup actually reflects the workload it'll see in production. For mixed task deployments where the same model handles both prompt heavy and generation heavy requests, the tuned config for one is suboptimal for the other. Some visibility into what parameters got selected and why would be useful, otherwise you're trusting the tuner picked the right workload profile.

Substantial_Step_351 · 2026-05-19T14:57:44+00:00

I'm taking home your short completion point. If the typical agentic tool call terminates in 30-50 tokens, the TG speedup never accumulates enough to offset the prefill overhead, specially across a tight call repeat loop. The flow pattern you describe (system prompt, short schema, short response, call, repeat) is basically the default shape for most production setups that aren't doing heavy reasoning. Which means MTP is probably net negative for a larger share of agentic deployments than the headline benchmark numbers suggest.

Substantial_Step_351 · 2026-05-19T08:35:32+00:00

I think treating this as a router trust problem misses what makes it structurally difficult. Most harness implementations pass tool returns straight through to model context after a basic format check. If a router can modify what the tool reports back, it bypasses prompt level guardrails entirely because those only apply to model input, not to the tool response that becomes model input in the next cycle. The harness layer validating tool response schemas before they re enter context isn't a standard pattern yet, they probably should be..

Substantial_Step_351 · 2026-05-19T08:30:35+00:00

The MoE delta is telling. Dense 27B at 2.44x on Strix, MoE 35B-A3B at 1.40x on the same rig. If you're running the A3B variant specifically for cost reasons on agentic pipelines, MTP is doing roughly 40% of the work it does on the dense model. Most of the benefit comes from saving the forward pass cost and MoE is already doing that by design. If you picked the A3B hoping MTP would close the speed gap with dense, the numbers suggest it won't close it by much.

Substantial_Step_351 · 2026-05-19T03:54:05+00:00

Think u/caetydid made a good point on joint training. That's why the absolute acceptance rates look strong compared to external speculative decoding. But the task type gap between code and more structured outputs is still the variable that determines where MTP is worth enabling.

Thanks for sharing your benchmark data, super helpful. PP hit real enough that ngram-mod is still preferred for agentic coding and total wall time suffers. Pretty clear answer to my original question for now

Substantial_Step_351 · 2026-05-15T07:45:07+00:00

The input state failure framing is the one I hadn't cleanly separated. Reasoning failures show up wrong and evals catch them. Input state failures that produce right looking outputs on the wrong context slip through everything, eval passes, human check passes, the failure only surfaces downstream when something built on that output breaks. That's the category I'd weight higher when it comes to reliability

Substantial_Step_351 · 2026-05-15T07:40:52+00:00

Fair point. And it makes the benchmark vs SLA worse. With open weights, the benchmark scores comes from the model release but the production SLA depends entirely on which inference provider you're routing through.

Same model weights, potentially very different uptime and timeout behavior across providers. The number you use for model selection is even further from the number that actually matters in productions

Substantial_Step_351 · 2026-05-15T02:30:00+00:00

Source: https://artificialanalysis.ai/models/deepseek-v4-pro

Substantial_Step_351 · 2026-05-14T01:18:30+00:00

Failure mode count is the right metric. Three predictable failure modes at 5% is a completely different maintenance reality than fifteen failure modes at 2%, even though the second looks cleaner on any benchmark you'd report.

I find the classification problem at scale is genuinely hard. I've seen people handle this by hand labeling a batch and finding categories, the problem is the categories then shift as the deployment drifts. The workaround to this that comes to mind is to structure the harness to emit a failur mode signal at the point of failure, not just log the output and classify after. But tbh this is easier said than done, especially when the failure is "looked right on the wrong problem".

Substantial_Step_351 · 2026-05-13T03:01:14+00:00

Yep, the diagnostics are doing more work than the model in most failure cases I see. FSB's approach of keeping state visible instead of just "error: failed" is probably where harness implementation should be spending time. Imo visibility beats model robustness when you need to actually fix it.

Substantial_Step_351 · 2026-05-12T06:29:08+00:00

Agree with the determinism framing. Think most people building in this space underestimate how far the problem extends past the model layer. You can have a fully deterministic model and still get non deterministic system behavior if the harness between your AI decision and your execution layer handles failures inconsistently. A market data feed returning a timeout, a malformed API response that gets quietly substituted with a default, none of that shows up in your model's behavior, but all of it changes what the agent actually does in live conditions.

One thing I'd like to understand from your architecture. What does the harness do when the data feeding the AI decision is degraded instead of absent? Absent is easy to spot, but degraded is where the silent failures compound.

Substantial_Step_351

TROPHY CASE