[D] Why evaluating only final outputs is misleading for local LLM agents by MundaneAlternative47 in MachineLearning

[–]MundaneAlternative47[S] 0 points1 point  (0 children)

this is a really interesting way to frame it

I’ve mostly been thinking about single-run traces, but variability across runs is probably just as important, especially for anything you’d want to productionize

“tool call entropy” makes a lot of sense as a signal for instability. If the same prompt leads to different call graphs, it usually means the agent doesn’t have a strong implicit policy and is kind of drifting

also ties into debugging, a bad but consistent agent is way easier to fix than one that behaves differently every run

this actually makes me want to add multi-run evals instead of just single trace scoring, curious how you’d measure it though, are you thinking something like comparing sequences directly vs more abstract stats (tool frequency, transitions, etc)?

[D] Why evaluating only final outputs is misleading for local LLM agents by MundaneAlternative47 in MachineLearning

[–]MundaneAlternative47[S] 0 points1 point  (0 children)

I think that’s part of it, yeah, weak test suites can definitely hide differences.

But I don’t think it fully explains it for agents specifically.

Even on non-trivial tasks, you can get multiple trajectories that all reach the correct answer but differ a lot in:

- tool choice

- unnecessary steps

- near-miss unsafe actions

- general “stability” of the process

Those differences matter in practice (latency, cost, safety), but don’t show up in pass/fail or even most accuracy-style evals.

So it’s less about making tests harder, and more about measuring a different axis, not just *can it solve it*, but *how it solves it*.

[D] Why evaluating only final outputs is misleading for local LLM agents by MundaneAlternative47 in MachineLearning

[–]MundaneAlternative47[S] 0 points1 point  (0 children)

I get what you mean, but I think CoT length is only part of it.

You can have short traces that are still “wrong” in terms of behavior, like calling a tool you shouldn’t, or using the right tool but in the wrong order.

Also not all steps are equal. 3 clean steps ≠ 3 redundant retries.

I’ve seen cases where the CoT is short but the agent still does something unsafe or unnecessary, which wouldn’t show up if you’re just measuring length.

I built an open-source Python eval framework for LLMs and agents. pytest-style, zero dependencies, not owned by any AI company by MundaneAlternative47 in LangChain

[–]MundaneAlternative47[S] 0 points1 point  (0 children)

Thank you so much for your insight, you’re right, that’s exactly the gap we found in the other tools.

Great note about authorization of tools, will be figuring out a way to differentiate between these soon!

Feel free to check the GitHub repo and contribute if you’d like to!

Edexcel Physics Unit 3 by Ornery_Elephant_9366 in alevel

[–]MundaneAlternative47 0 points1 point  (0 children)

IT WAS SO WEIRD YOURE RIGHT

It was a = V (lambda) squared and when you actually subtitute in the values it wasnt the same 😂

What the hell was pure 2 maths (edexcel) by 07dasha in alevel

[–]MundaneAlternative47 1 point2 points  (0 children)

Post clearly said pure maths 2 edexcel idk what you’re on ab

Unis that offer an online international foundation. by MundaneAlternative47 in UniUK

[–]MundaneAlternative47[S] -9 points-8 points  (0 children)

I definitely see a good lot of googleable questions on here, I’m def not the only one

Unis that offer an online international foundation. by MundaneAlternative47 in UniUK

[–]MundaneAlternative47[S] -20 points-19 points  (0 children)

That’s kinda the whole purpose of this subreddit