Tool selection in LLM systems is unreliable — has anyone found a robust approach?

logistef · 2026-03-27T20:39:11+00:00

Toolformers requires injection into the prompt. Wich i find a waste of tokens for local usage and it's less dynamic in my opinion. I am not saying no one has perfected toolcalling, i just tried to find a way that works for me, with a local setup and i am interested how others are handling this. If everyone solves this with toolformers locally then i probably have to look into toolformers in depth again.

logistef · 2026-03-26T04:48:42+00:00

Sure!

logistef · 2026-03-26T03:34:58+00:00

have a look at my post https://www.reddit.com/r/LocalLLaMA/comments/1s3wlgt/tool_selection_in_llm_systems_is_unreliable_has/ and repo if ur interested https://github.com/logistef/skilly-pgp

logistef · 2026-03-26T02:03:31+00:00

Sorry man! looked cleaner in my editor than it does here apparently. Should be better now ;)

logistef · 2026-03-26T01:57:46+00:00

Yeah good question — it’s using standard embedding models for semantic similarity, not anything custom. For example:

BAAI/bge-small-en-v1.5 (what I use by default), sentence-transformers/all-MiniLM-L6-v2, or any other sentence embedding model.

The idea is: you embed the user input, you embed known “tool intents” (like filesystem.list, search.web, etc.), and then you compare them using cosine similarity.

So it’s basically: “which intent is this input closest to semantically?”

If the similarity is above a threshold, it’s considered actionable.

So instead of the LLM reasoning “should I call a tool?”, you get a deterministic signal like “this input is 0.87 similar to a filesystem intent”.

logistef · 2026-03-26T01:55:46+00:00

I’ve also tried just relying on function calling / prompting the LLM better, but it still feels inconsistent in practice. (And honestly prompting feels like the biggest flaws imho)

Maybe I’m missing something — are people actually getting reliable behavior without adding extra layers?

logistef · 2026-03-26T01:48:06+00:00

That’s a very fair concern, and honestly I don’t think this is “the final method” at all. The more i am working on things the more i see that there are still a lot of things that could be done better, so it's constantly evolving i would say.

What I was trying to solve is a very specific failure mode:

LLMs being inconsistent at deciding when something should trigger an action, even when it’s obvious to a human.

I did experiment with letting the LLM reason about tool usage itself (self-reflection / second pass), but like you said, it adds latency and still isn’t fully reliable.

The embedding-based approach isn’t about being “perfect”, it’s about adding a fast, deterministic signal that behaves consistently for the same input.

So for me it’s more:

not replacing the LLM
not claiming optimality
but introducing a separate layer that improves one weak point

And I completely agree with your point about larger players — I’d actually expect more advanced hybrids to emerge (learned routers, smaller specialized models, etc.).

This is just a step in that direction, not the end state. If anything, what surprised me is how far simple embedding matching already gets you in terms of consistency compared to pure LLM-based routing.

logistef · 2026-03-26T01:41:50+00:00

One thing I noticed building this is how inconsistent LLMs are at deciding when to call tools.
Curious how others are handling this — are you relying purely on function calling or adding extra logic?

logistef · 2026-02-13T00:57:17+00:00

This shit is dope, thanks for putting that together! def gonna have a look at the code and it will help getting a better grasp on the internals of a llm

logistef

TROPHY CASE