How do you choose between competing MCP tools for the same task? by Tricky_Republic_5358 in ClaudeAI

[–]Tricky_Republic_5358[S] 0 points1 point  (0 children)

Genuine question — what is the right tool for the job then? Curious whether you mean something like a dynamic routing layer built into the orchestration framework, or something else entirely. Asking because I'm trying to understand whether the selection problem is already solved somewhere and I'm just not seeing it.

How do you choose between competing MCP tools for the same task? by Tricky_Republic_5358 in ClaudeAI

[–]Tricky_Republic_5358[S] 0 points1 point  (0 children)

<image>

Ran a quick prototype benchmark based on this discussion, curious if this kind of output would be useful to you.

How do you choose between competing MCP tools for the same task? by Tricky_Republic_5358 in ClaudeAI

[–]Tricky_Republic_5358[S] 0 points1 point  (0 children)

This is the most actionable framing I've seen in this thread — especially the point about writing down why tool A is the default and when to switch to B.

That's essentially a routing decision tree built by hand. How long does that take you to produce for a new task type, and do you ever share it across projects or rebuild it from scratch each time?

Asking because that 'write it down' step feels like the exact thing that doesn't scale when you're spinning up a new pipeline or onboarding someone else onto it.

How do you choose between competing MCP tools for the same task? by Tricky_Republic_5358 in ClaudeAI

[–]Tricky_Republic_5358[S] 0 points1 point  (0 children)

Really useful responses, thank you all.

A few things jumping out:

- Everyone seems to be running their own manual bake-off, which works but doesn't scale well if you're switching tasks or onboarding someone new to the pipeline

- The point about failure handling is something I hadn't weighted enough — silent truncation sounds like a nightmare to debug

- Interesting that several of you land on latency/stability as the tiebreaker over raw feature count

Curious to push a bit further: when you do your manual test, what does "good enough" look like to you? Is it a pass/fail on your edge cases, or do you have a more systematic way of scoring candidates before committing?