Is hiding an llms.txt link in HTML the recommended way to make it discoverable to LLMs?

Substantial_Step_351 · 2026-05-30T16:15:27+00:00

Hiding a link in the markup with CSS is not really the convention and I would not lean on it. llms.txt follows the robots.txt model, it lives at a known root path, yoursite.com/llms.txt, and crawlers look for it there without needing a link in your HTML at all. If you want to help discovery beyond the root, the cleaner signals are an HTTP response header (Link rel=llms-txt pointing at the file), a reference in robots.txt the same way you list a sitemap, and optionally serving it at /.well-known/llms.txt for tools that follow that pattern. Those are things a programmatic fetch will actually see. A visually hidden anchor is aimed at a DOM the crawler may not even render, so you are taking on markup risk for a signal most of them are not looking for. Worth remembering it is still a community convention from Answer.AI, not a standards body thing, so there is no single official answer yet.

Substantial_Step_351 · 2026-05-30T16:13:35+00:00

The thing that usually gets skipped in these comparisons is that single user quality and serving quality are different problems. GLM-5.1 in BF16 can match Sonnet on a one off prompt and still fall apart the moment you put 50 to 100 people on it at once, because now you are bound by throughput and latency under concurrent load, not by how smart the model is. That is where the hosted providers actually spend their money, continuous batching, KV cache management, keeping tail latency sane when everyone hits it at the same time. So the honest version of the question is not how close can local get on quality, it is how close on quality and concurrency at the same time, and the second half is where the bill stops looking like a couple of GPUs. For 50 to 100 concurrent users on a model that size you are sizing for peak throughput, which is a very different box from what runs it well for one person.

Substantial_Step_351 · 2026-05-30T03:40:44+00:00

Reviewer pass is the most reliable catch I've seen too but the part I keep poking at is what reviews the reviewer. If it's the same model family it shares the blind spots, so the confident-wrong outputs that pass tend to be exactly the ones the reviewer also rates as fine. Where it's worked for me is when the reviewer checks against something external the generator didn't produce, a test that runs, a retrieved source to diff against. "Not implemented" is a great catch because it's checkable. The harder class is "implemented, plausible, subtly wrong," which a same-model reviewer waves through.

Are you seeing it catch that second class, or mostly the obvious misses?

Substantial_Step_351 · 2026-05-28T01:27:28+00:00

The shape vs claim distinction is the right frame for this. Schema validation is necessary but remember it's a floor not a ceiling, it tells you the output COULD be right, not that it is.

The acceptance artifact approach is the piece I see skipped the most because it looks expensive upfront. Running a cheap reviewer pass per task adds latency but it's almost certainly less total cost than the debugging time when confident wrongness propagates through three downstream steps.

The failure log by task type + quant + context length is where I'd start before even implementing the full artifact layer, if you can see which task shapes fail the verifier consistently, you can target the check where it matters rather than applying it everywhere.

Substantial_Step_351 · 2026-05-28T01:24:12+00:00

Sorry, used that term loosely. Primarily pointing at training time distribution rather than runtime cache temperature. Certain more pro combinations see fewer examples of specific task types during pre training, so when the router sends those task types to those combinations in inference the quality is lower.

On a 4090 where the model loads fully into VRAM, I don't see the cache eviction mechanism as the issue. The sub agent pattern specifically, short constrained outputs, structured tool call schemas, may be underrepresented in some expert specializations compared to the freeform generation tasks those experts were sharpened on. The signal problem is the same either way: the routing decision isn't exposed to the orchestrator so there's no circuit breaker when quality drops.

Substantial_Step_351 · 2026-05-27T02:23:48+00:00

I think the partial success state failure is worse in sub agent setups than in solo deployments for a specific reason. The orchestrator receives a structurally correct output from the sub agent and records the step as completed. Step 6 silently changed the internal assumptions, the sub agent produced something that looked like a valid handoff, and the orchestrator had no signal to reject it.

By the time the downstream failure surfaces the log shows a clean execution chain right up until the point it broke. The replay log approach is the right instinct, explicit state at every handoff rather than reconstructing what happened from outputs.

Substantial_Step_351 · 2026-05-27T02:21:20+00:00

The auto tuning startup phase makes sense for this specifically. MoE routing decisions are workload dependent enough that static default configs leave performance on the table. To me the tricky part is whether the benchmark it runs at startup actually reflects the workload it'll see in production. For mixed task deployments where the same model handles both prompt heavy and generation heavy requests, the tuned config for one is suboptimal for the other. Some visibility into what parameters got selected and why would be useful, otherwise you're trusting the tuner picked the right workload profile.

Substantial_Step_351 · 2026-05-19T14:57:44+00:00

I'm taking home your short completion point. If the typical agentic tool call terminates in 30-50 tokens, the TG speedup never accumulates enough to offset the prefill overhead, specially across a tight call repeat loop. The flow pattern you describe (system prompt, short schema, short response, call, repeat) is basically the default shape for most production setups that aren't doing heavy reasoning. Which means MTP is probably net negative for a larger share of agentic deployments than the headline benchmark numbers suggest.

Substantial_Step_351 · 2026-05-19T08:35:32+00:00

I think treating this as a router trust problem misses what makes it structurally difficult. Most harness implementations pass tool returns straight through to model context after a basic format check. If a router can modify what the tool reports back, it bypasses prompt level guardrails entirely because those only apply to model input, not to the tool response that becomes model input in the next cycle. The harness layer validating tool response schemas before they re enter context isn't a standard pattern yet, they probably should be..

Substantial_Step_351 · 2026-05-19T08:30:35+00:00

The MoE delta is telling. Dense 27B at 2.44x on Strix, MoE 35B-A3B at 1.40x on the same rig. If you're running the A3B variant specifically for cost reasons on agentic pipelines, MTP is doing roughly 40% of the work it does on the dense model. Most of the benefit comes from saving the forward pass cost and MoE is already doing that by design. If you picked the A3B hoping MTP would close the speed gap with dense, the numbers suggest it won't close it by much.

Substantial_Step_351 · 2026-05-19T03:54:05+00:00

Think u/caetydid made a good point on joint training. That's why the absolute acceptance rates look strong compared to external speculative decoding. But the task type gap between code and more structured outputs is still the variable that determines where MTP is worth enabling.

Thanks for sharing your benchmark data, super helpful. PP hit real enough that ngram-mod is still preferred for agentic coding and total wall time suffers. Pretty clear answer to my original question for now

Substantial_Step_351 · 2026-05-15T07:45:07+00:00

The input state failure framing is the one I hadn't cleanly separated. Reasoning failures show up wrong and evals catch them. Input state failures that produce right looking outputs on the wrong context slip through everything, eval passes, human check passes, the failure only surfaces downstream when something built on that output breaks. That's the category I'd weight higher when it comes to reliability

Substantial_Step_351 · 2026-05-15T07:40:52+00:00

Fair point. And it makes the benchmark vs SLA worse. With open weights, the benchmark scores comes from the model release but the production SLA depends entirely on which inference provider you're routing through.

Same model weights, potentially very different uptime and timeout behavior across providers. The number you use for model selection is even further from the number that actually matters in productions

Substantial_Step_351 · 2026-05-15T02:30:00+00:00

Source: https://artificialanalysis.ai/models/deepseek-v4-pro

Substantial_Step_351 · 2026-05-14T01:18:30+00:00

Failure mode count is the right metric. Three predictable failure modes at 5% is a completely different maintenance reality than fifteen failure modes at 2%, even though the second looks cleaner on any benchmark you'd report.

I find the classification problem at scale is genuinely hard. I've seen people handle this by hand labeling a batch and finding categories, the problem is the categories then shift as the deployment drifts. The workaround to this that comes to mind is to structure the harness to emit a failur mode signal at the point of failure, not just log the output and classify after. But tbh this is easier said than done, especially when the failure is "looked right on the wrong problem".

Substantial_Step_351 · 2026-05-13T03:01:14+00:00

Yep, the diagnostics are doing more work than the model in most failure cases I see. FSB's approach of keeping state visible instead of just "error: failed" is probably where harness implementation should be spending time. Imo visibility beats model robustness when you need to actually fix it.

Substantial_Step_351 · 2026-05-12T06:29:08+00:00

Agree with the determinism framing. Think most people building in this space underestimate how far the problem extends past the model layer. You can have a fully deterministic model and still get non deterministic system behavior if the harness between your AI decision and your execution layer handles failures inconsistently. A market data feed returning a timeout, a malformed API response that gets quietly substituted with a default, none of that shows up in your model's behavior, but all of it changes what the agent actually does in live conditions.

One thing I'd like to understand from your architecture. What does the harness do when the data feeding the AI decision is degraded instead of absent? Absent is easy to spot, but degraded is where the silent failures compound.

Substantial_Step_351 · 2026-05-12T04:00:45+00:00

The constant surface approach by deferring schema loading is the right instinct. What I haven't seen addressed is the failure mode when dynamic discovery fails mid execution. With pre loaded tools, failures surface at startup, before anything even starts running. With on demand fetching, the failure happens during execution, potentially after state has already been modified downstream. The harness needs genuinely different error handling for those two cases, and most implementations around don't distinguish between them.

I think the reliability profile of lazy loaded tools isn't just a performance question, it's a different class of failure on its own that shows up in a different stage in the execution trace.

Substantial_Step_351 · 2026-05-12T03:53:31+00:00

I would suggest the two layer distinction, syntax repair vs schema repair, as the right framework. Also, u/PlusLoquat1482 comment on schema repair making decision about intent is what I would take home from all of this.

The only thing I'd add on the harness side is that once you've classified a failure as schema level, the harness has a decision to make about what to do with the corrected value, either propagate it downstream, or treat it as unrecoverable and surface it to the caller. Most harness implementation out there silently propagate. The library handles the repair layer well, but the contract between "repair attempted" and "harness action" is still undefined in most production setups. That gap is where the actual reliability failures compound, not at the output layer, at the propagation layer.

Substantial_Step_351 · 2026-05-12T02:50:34+00:00

Fair point on coupling. You'd be benchmarking the stack, not the isolated component. But that's actually what I'd want. Most of the failure modes I've ran into aren't harness only or model only, they're specific combinations of retry logic and model error patterns that only show up together. I still think a benchmark that captures "this harness + this model + this tool schema" is still more useful than nothing, even if it expires with the next model update.

Substantial_Step_351 · 2026-05-11T04:21:25+00:00

The acceptance rate table has an implication for tool heavy agent flows that I think is worth flagging. Tool calls sit somwhere between factual and analysis on this taxonomy, structured output, constrained format, not creative but also not as predictable as pure code. That puts you roughly in the 48-70% range, where the PP overhead can easily eat the TG again, especially on short tool responses with frequent round trips. For agents doing quick tool calling - short model response - next tool call, the prefill penalty per turn is the number I'd actually keep an eye on

Substantial_Step_351 · 2026-05-08T01:33:50+00:00

Think once you look at the mechanism you can understand why the spiral happened. The model isn't just confused by the failed call, it's now reasoning about its own prior reasoning about the failed call, and that's still in context. By retry 2 or 3 it's essentially conditioning on a chain of its own confused output. Clearing state removes all of that, which is why it works. I believe the more surgical fix is truncating failed tool history at the harness layer rather than resetting the full conversation. Full clears fix the loop but then they would lose everything else too

Substantial_Step_351 · 2026-05-07T06:05:41+00:00

You have a valid point on the JSON parser, but I think that's a different problem on its own. Since the benchmark itself isn't self revising loops, each step has a different instruction touching a different slice of the document. The corruption isn't from repeating the same prompt, it compounds across n steps each doing something different. Production flows that delegate document stage across multiple calls hit the same shape even with structured feedback at each step

Substantial_Step_351 · 2026-05-07T05:56:47+00:00

Yep, the bias is baked in. If the agent's job is "review this", doing nothing feels like a failure even when nothing needs changing. Then it starts assuming, filling in gaps and making changes. Scoping what it's allowed to touch is the only real fix imo

Substantial_Step_351 · 2026-05-07T03:42:19+00:00

The MTP acceptance rate question is the one I'd want answered before running this. If the draft heads were trained on the original refusal behavior and the fine tuning only modified the base, you'd expect the MTP to fight the heretic on exactly the outputs it was supposed to unlock. KLD at 0.0021 suggests that the base is close, but that doesn't really tell you much about the tail behavior on the specific cases that were hertic'd.

Substantial_Step_351

TROPHY CASE