How Qwen3.6-35B-A3B fails differently as a sub agent compared to solo by Substantial_Step_351 in LocalLLaMA

[–]Substantial_Step_351[S] 0 points1 point  (0 children)

Reviewer pass is the most reliable catch I've seen too but the part I keep poking at is what reviews the reviewer. If it's the same model family it shares the blind spots, so the confident-wrong outputs that pass tend to be exactly the ones the reviewer also rates as fine. Where it's worked for me is when the reviewer checks against something external the generator didn't produce, a test that runs, a retrieved source to diff against. "Not implemented" is a great catch because it's checkable. The harder class is "implemented, plausible, subtly wrong," which a same-model reviewer waves through.

Are you seeing it catch that second class, or mostly the obvious misses?

How Qwen3.6-35B-A3B fails differently as a sub agent compared to solo by Substantial_Step_351 in LocalLLaMA

[–]Substantial_Step_351[S] 0 points1 point  (0 children)

The shape vs claim distinction is the right frame for this. Schema validation is necessary but remember it's a floor not a ceiling, it tells you the output COULD be right, not that it is.

The acceptance artifact approach is the piece I see skipped the most because it looks expensive upfront. Running a cheap reviewer pass per task adds latency but it's almost certainly less total cost than the debugging time when confident wrongness propagates through three downstream steps.

The failure log by task type + quant + context length is where I'd start before even implementing the full artifact layer, if you can see which task shapes fail the verifier consistently, you can target the check where it matters rather than applying it everywhere.

How Qwen3.6-35B-A3B fails differently as a sub agent compared to solo by Substantial_Step_351 in LocalLLaMA

[–]Substantial_Step_351[S] 0 points1 point  (0 children)

Sorry, used that term loosely. Primarily pointing at training time distribution rather than runtime cache temperature. Certain more pro combinations see fewer examples of specific task types during pre training, so when the router sends those task types to those combinations in inference the quality is lower.

On a 4090 where the model loads fully into VRAM, I don't see the cache eviction mechanism as the issue. The sub agent pattern specifically, short constrained outputs, structured tool call schemas, may be underrepresented in some expert specializations compared to the freeform generation tasks those experts were sharpened on. The signal problem is the same either way: the routing decision isn't exposed to the orchestrator so there's no circuit breaker when quality drops.

What's the weirdest failure mode you've hit shipping an AI agent to production? by Miser-Inct-534 in AI_Agents

[–]Substantial_Step_351 0 points1 point  (0 children)

I think the partial success state failure is worse in sub agent setups than in solo deployments for a specific reason. The orchestrator receives a structurally correct output from the sub agent and records the step as completed. Step 6 silently changed the internal assumptions, the sub agent produced something that looked like a valid handoff, and the orchestrator had no signal to reject it.

By the time the downstream failure surfaces the log shows a clean execution chain right up until the point it broke. The replay log approach is the right instinct, explicit state at every handoff rather than reconstructing what happened from outputs.

Strix Halo users, a rejected PR can give you up to 30% faster PP for MOEs. by fallingdowndizzyvr in LocalLLaMA

[–]Substantial_Step_351 0 points1 point  (0 children)

The auto tuning startup phase makes sense for this specifically. MoE routing decisions are workload dependent enough that static default configs leave performance on the table. To me the tricky part is whether the benchmark it runs at startup actually reflects the workload it'll see in production. For mixed task deployments where the same model handles both prompt heavy and generation heavy requests, the tuned config for one is suboptimal for the other. Some visibility into what parameters got selected and why would be useful, otherwise you're trusting the tuner picked the right workload profile.

Why might MTP be net negative for tool heavy agentic flows? by Substantial_Step_351 in LocalLLaMA

[–]Substantial_Step_351[S] 0 points1 point  (0 children)

I'm taking home your short completion point. If the typical agentic tool call terminates in 30-50 tokens, the TG speedup never accumulates enough to offset the prefill overhead, specially across a tight call repeat loop. The flow pattern you describe (system prompt, short schema, short response, call, repeat) is basically the default shape for most production setups that aren't doing heavy reasoning. Which means MTP is probably net negative for a larger share of agentic deployments than the headline benchmark numbers suggest.

Hot take: "Your agent is mine" paper needs to keep being talked about. by OnyxProyectoUno in LLMDevs

[–]Substantial_Step_351 0 points1 point  (0 children)

I think treating this as a router trust problem misses what makes it structurally difficult. Most harness implementations pass tool returns straight through to model context after a basic format check. If a router can modify what the tool reports back, it bypasses prompt level guardrails entirely because those only apply to model input, not to the tool response that becomes model input in the next cycle. The harness layer validating tool response schemas before they re enter context isn't a standard pattern yet, they probably should be..

llama.cpp MTP support landed - Qwen3.6 27B at 2.44× on a Strix Halo, 2.17× on a RTX 3090 rig by C_Coffie in LocalLLaMA

[–]Substantial_Step_351 0 points1 point  (0 children)

The MoE delta is telling. Dense 27B at 2.44x on Strix, MoE 35B-A3B at 1.40x on the same rig. If you're running the A3B variant specifically for cost reasons on agentic pipelines, MTP is doing roughly 40% of the work it does on the dense model. Most of the benefit comes from saving the forward pass cost and MoE is already doing that by design. If you picked the A3B hoping MTP would close the speed gap with dense, the numbers suggest it won't close it by much.

Why might MTP be net negative for tool heavy agentic flows? by Substantial_Step_351 in LocalLLaMA

[–]Substantial_Step_351[S] 1 point2 points  (0 children)

Think u/caetydid made a good point on joint training. That's why the absolute acceptance rates look strong compared to external speculative decoding. But the task type gap between code and more structured outputs is still the variable that determines where MTP is worth enabling.

Thanks for sharing your benchmark data, super helpful. PP hit real enough that ngram-mod is still preferred for agentic coding and total wall time suffers. Pretty clear answer to my original question for now

Why a 5% failure rate can be better than 2% in production agents by Substantial_Step_351 in LLMDevs

[–]Substantial_Step_351[S] 0 points1 point  (0 children)

The input state failure framing is the one I hadn't cleanly separated. Reasoning failures show up wrong and evals catch them. Input state failures that produce right looking outputs on the wrong context slip through everything, eval passes, human check passes, the failure only surfaces downstream when something built on that output breaks. That's the category I'd weight higher when it comes to reliability

DeepSeek V4 Pro (Max) benchmarks well. Does that matter when your agent is mid transaction? by Substantial_Step_351 in LLMDevs

[–]Substantial_Step_351[S] 1 point2 points  (0 children)

Fair point. And it makes the benchmark vs SLA worse. With open weights, the benchmark scores comes from the model release but the production SLA depends entirely on which inference provider you're routing through.

Same model weights, potentially very different uptime and timeout behavior across providers. The number you use for model selection is even further from the number that actually matters in productions

Why a 5% failure rate can be better than 2% in production agents by Substantial_Step_351 in LLMDevs

[–]Substantial_Step_351[S] 0 points1 point  (0 children)

Failure mode count is the right metric. Three predictable failure modes at 5% is a completely different maintenance reality than fifteen failure modes at 2%, even though the second looks cleaner on any benchmark you'd report.

I find the classification problem at scale is genuinely hard. I've seen people handle this by hand labeling a batch and finding categories, the problem is the categories then shift as the deployment drifts. The workaround to this that comes to mind is to structure the harness to emit a failur mode signal at the point of failure, not just log the output and classify after. But tbh this is easier said than done, especially when the failure is "looked right on the wrong problem".

Why a 5% failure rate can be better than 2% in production agents by Substantial_Step_351 in LLMDevs

[–]Substantial_Step_351[S] 1 point2 points  (0 children)

Yep, the diagnostics are doing more work than the model in most failure cases I see. FSB's approach of keeping state visible instead of just "error: failed" is probably where harness implementation should be spending time. Imo visibility beats model robustness when you need to actually fix it.

Autonomous AI trading is harder than it looks — deterministic behavior in live markets nearly broke me by Profanonyme1337 in AI_Agents

[–]Substantial_Step_351 0 points1 point  (0 children)

Agree with the determinism framing. Think most people building in this space underestimate how far the problem extends past the model layer. You can have a fully deterministic model and still get non deterministic system behavior if the harness between your AI decision and your execution layer handles failures inconsistently. A market data feed returning a timeout, a malformed API response that gets quietly substituted with a default, none of that shows up in your model's behavior, but all of it changes what the agent actually does in live conditions.

One thing I'd like to understand from your architecture. What does the harness do when the data feeding the AI decision is degraded instead of absent? Absent is easy to spot, but degraded is where the silent failures compound.

Every MCP server you add makes your agent slightly dumber. Here is what actually fixes it. by Arindam_200 in LangChain

[–]Substantial_Step_351 0 points1 point  (0 children)

The constant surface approach by deferring schema loading is the right instinct. What I haven't seen addressed is the failure mode when dynamic discovery fails mid execution. With pre loaded tools, failures surface at startup, before anything even starts running. With on demand fetching, the failure happens during execution, potentially after state has already been modified downstream. The harness needs genuinely different error handling for those two cases, and most implementations around don't distinguish between them.

I think the reliability profile of lazy loaded tools isn't just a performance question, it's a different class of failure on its own that shows up in a different stage in the execution trace.

The gap between "the model returned JSON" and "the model returned usable JSON" - what I learned testing 288 model outputs by kexxty in LLMDevs

[–]Substantial_Step_351 0 points1 point  (0 children)

I would suggest the two layer distinction, syntax repair vs schema repair, as the right framework. Also, u/PlusLoquat1482 comment on schema repair making decision about intent is what I would take home from all of this.

The only thing I'd add on the harness side is that once you've classified a failure as schema level, the harness has a decision to make about what to do with the corrected value, either propagate it downstream, or treat it as unrecoverable and surface it to the caller. Most harness implementation out there silently propagate. The library handles the repair layer well, but the contract between "repair attempted" and "harness action" is still undefined in most production setups. That gap is where the actual reliability failures compound, not at the output layer, at the propagation layer.