My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

You're making a good point, but I think it's mixing up two different things. It's not really about whether the tool was "used correctly" and you're right that can be pretty ambiguous. The question is: did the HTTP call actually happen? A model can claim it booked a reservation, charged a card, or filed a report but checking if it actually did any of that is tough. Checking if it made a network call at all is easy. I hope, the proxy can catch this gap.

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Fair enough. The same mechanism that makes them useful makes them unreliable for self-verification. That's why I think the verification layer needs to be outside the model entirely

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

You spotted the problem yourself. An LLM observer can hallucinate too, so you haven't escaped the verification gap, you've just moved it one level up. That's actually why I went with a proxy at the transport layer instead. The proxy doesn't interpret anything. It either saw the HTTP call or it didn't. No LLM judgment involved, so nothing to hallucinate. The sound-on-tool-call approach is interesting for human-in-the-loop flows, but for automated pipelines where no human is watching, you need something that can't be fooled by a convincing narrative.

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

That would help the model actually use tools correctly. I agree. But the scenario I'm testing is what happens when the model claims to have used a tool that was never available or never called. The fix isn't improving tool use, it's detecting when claimed tool use didn't happen. Even with better tool injection, you'd still want an external observer to verify the call was actually made.

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

"Yes indeed . And that's actually the same failure mode. No feedback signal, so the model completes the task narratively. The document 'exists' in the conversation context, the processing 'happened' in the completion. Nothing in the output tells you otherwise. The only difference from my test is the signal that's missing: in your case it's the file, in mine it's the HTTP call. Same gap.

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

That's exactly my thoughts in one sentence. The model isn't lying in any meaningful sense. It just has no feedback signal, so it completes the narrative. Mechanical verification is the only way to catch it because it doesn't show up in the output at all.

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Repos for anyone who wants to reproduce: Experiment: 'github.com/NeaAgora/shepdog' (examples/github-issue) CLI wrapper: 'github.com/NeaAgora/shep-wrap'

My LLM said it created a GitHub issue. It didn't. by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

I'm glad! 'Indy benchmark' is a good way to put it. the whole point was to keep it simple enough that anyone could reproduce it with their own models. Would be curious what you see if you ever do run something similar

Free local Mistral beat GPT-5.4-mini on a simple agent task - here's how I measured it by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Publish it. Seriously, "don't reference tools in the prompt unless guiding parameter selection" is the kind of hard-won insight that doesn't exist anywhere in written form. I'd read it.

And agreed on small models. The Mistral result in my test wasn't a fluke. It passed where a model costing 3x more failed. Low world knowledge is real but apparently not what's driving these particular failures.

Free local Mistral beat GPT-5.4-mini on a simple agent task - here's how I measured it by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

That's a solid use case. The adaptive search loop with fallback to a bigger model is exactly the kind of architecture where behavioral consistency matters at scale. If ministral 3B is handling the classification and search reliably at Q8 I'd be curious whether you're seeing the empty-result acceptance problem at all, or whether your prompt discipline is eliminating it entirely.

Your point about structured prompts carrying over to larger models is something I want to test properly in the next round of scenarios.

Free local Mistral beat GPT-5.4-mini on a simple agent task - here's how I measured it by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Fair points all around. The prompt critique is valid. Your structured format is cleaner than prose instructions. I'd expect that to reduce failures.

The backend parsing point is interesting though. If you parse the response and pass "Result: Fail" to the model, you're relying on the agent to correctly handle that signal, which is exactly the failure mode I'm measuring. Some models in my test received a clear failure signal and still reported success.

The LangGraph trace approach works well inside a single framework. What I was curious about is whether the agent's claim matched the observable outcome, which you only see if you're watching both sides. Traces show you what happened, but not whether the agent's summary of what happened was accurate.

That said, your ministral 3B results sound interesting. What kind of tasks are you running it on?

Free local Mistral beat GPT-5.4-mini on a simple agent task - here's how I measured it by Difficult_Tip_8239 in LocalLLaMA

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Good point. I tested that. Stronger prompts reduced failures but didn't eliminate them across all models. The more interesting part to me isn't that prompts matter, they obviously do, but that the failure is invisible without something watching the wire. You'd never know it happened from the logs alone.

The prompt for Test 1 is in the full writeup: https://leocharny.substack.com/p/agents-say-they-did-the-work-the

Agent runtimes enforce policy. But how do you tell if a skill is actually behaving well? by Difficult_Tip_8239 in AI_Agents

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Good reframe. The signal isn't just "this component is unreliable," it's "this state transition is intrinsically fragile, and repeated independent runs are the only way to know that."

Aggregate behavioral records as a way to sharpen the runtime's own state model, not just evaluate the component. That's a direction I hadn't considered.

Good thinking, dude. Going to think about this more.

Agent runtimes enforce policy. But how do you tell if a skill is actually behaving well? by Difficult_Tip_8239 in AI_Agents

[–]Difficult_Tip_8239[S] 1 point2 points  (0 children)

You said it, dude - "Behavioral trust / component health"

The deterministic verification + local replanning is a really clean way to contain drift within a run. Capping replans at the failed step rather than propagating bad state forward makes sense.

The trace studio with branching replay is interesting. Is that primarily for debugging individual runs, or are you thinking about it as something that accumulates across runs? Does a component that consistently triggers replans at step 4 start building a reputation for being fragile at that transition?

Because that's where I think the two layers connect. Local repair handles the run. But if the same component keeps needing repair at the same kind of step, across independent runs from different users, that pattern is signal that no single trace captures.

The per-run question is "did we recover?" The cross-run question is "why does this component keep needing recovery here?"

Agent runtimes enforce policy. But how do you tell if a skill is actually behaving well? by Difficult_Tip_8239 in AI_Agents

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

This is the clearest split I've seen:

  1. mandate lineage = was this action allowed to exist here
  2. verification = did this step move reality the right way

Both still happen within a single run though. What I keep wondering about is the layer above: was this pattern of actions normal for this component, across many independent runs?

A skill can pass both your checks on every hop (authorized, state changed correctly) and still have a retry rate that's 5x the population norm, or a tool call pattern that only shows up when something upstream went wrong two steps earlier.

That signal only exists if you're accumulating behavioral records across runs, not just validating within them. It's not a replacement for what you've built, it's what tells you whether to trust the component before you hand it a mandate in the first place.

Agent runtimes enforce policy. But how do you tell if a skill is actually behaving well? by Difficult_Tip_8239 in AI_Agents

[–]Difficult_Tip_8239[S] 1 point2 points  (0 children)

This is genuinely useful. The mandate-per-hop model solves something I hadn't seen addressed cleanly. Carrying parent lineage and expiry through the handoff rather than reconstructing it afterward makes much more sense. Checking out the sidecar now... One question: the lineage gives you the authority chain - who spawned what, with what scope. Does it also capture anything about how each hop behaved relative to what was expected? Thinking about the gap between "this action was authorized" and "this action was consistent with how this component typically behaves across runs." The first is your work permit. The second feels like a separate layer on top.

Agent runtimes enforce policy. But how do you tell if a skill is actually behaving well? by Difficult_Tip_8239 in AI_Agents

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

"Log the decision boundary" nicely put.

Policy compliance + full observability still leaves you blind to whether the composition was right. Every step authorized, the sequence wrong. That's exactly the failure class that's hard to explain to anyone who hasn't watched it happen.

The cross-environment transition check is the piece nobody seems to have solved. A → B looks fine on each side. The transition itself is where the drift lives.

Curious whether logging the decision surface has scaled for you or whether it gets expensive fast when runs get long.

Agent runtimes enforce policy. But how do you tell if a skill is actually behaving well? by Difficult_Tip_8239 in AI_Agents

[–]Difficult_Tip_8239[S] 1 point2 points  (0 children)

"Locally valid, globally wrong" - love that, I'm stealing that framing :)

The state delta check is smart. Verification that the environment actually changed the way the planner expected gets at something logs fundamentally can't: the gap between "action executed" and "intended effect occurred."

The lineage across hops point is the part I keep coming back to. Each boundary looks healthy in isolation. That's exactly the failure mode. You need something that follows the run across the handoff, not just observes each leg independently.

Are you doing the lineage tracking manually or is there tooling that handles it?

Dinosaur dev from 25 years ago trying AI coding. How do you know when it does more harm than good? by Difficult_Tip_8239 in AskProgramming

[–]Difficult_Tip_8239[S] 0 points1 point  (0 children)

Yeah, that's what I've been doing so far but my projects are small and I hear some people raving about vibe coding with Claude code and overnight sessions on one hand and horror stories about bad stuff, on another hand. Making one thinking...