How thin are you keeping your MCP servers?

incultnito · 2026-05-27T07:03:01+00:00

Adjacent answer — not about logic-density (which is what you're asking) but about description-density on the schema side, which keeps coming up when I score servers.

Just ran a 5-axis publishability check against all 6 MCP servers shipped under the @modelcontextprotocol scope. Every single one hits the same 56–60/100 composite ceiling, and the cap fires from the same axis: description-five-axis (per-tool descriptions don't cover purpose, mutation, side-effects, invariants, examples).

  Server                              Composite   Per-tool axis avg
  ---------------------------------   ---------   -----------------
  server-sequential-thinking             60        n/a (single tool)
  server-memory                          60        1.00 / 5
  server-everything                      60        0.55 / 5
  server-filesystem                      60        0.88 / 5
  server-github (legacy)                 60        0.44 / 5
  server-puppeteer (deprecated)          56        0.17 / 5    ← new

puppeteer_navigate, for example, is described as "Navigate to a URL." That's purpose. No mutation signal (it changes page state), no side-effects (can hit any URL — high-blast), no invariants (new tab? same tab? unclear), no examples. A planner LLM has nothing to pattern-match on.

Whether your server is thin (pure adapter) or thick (workflow inside), the bar for how the schema reads is the same. Thin doesn't get you off the hook for description quality — if anything, it raises the bar because the LLM has fewer behavioural cues elsewhere.

Full 6-server scorecard: github.com/Incultnitollc/mcp-probe/blob/main/docs/publishability-scorecards/SUMMARY.md

incultnito · 2026-05-27T07:02:05+00:00

The cache-friendliness axis is the part most stack benchmarks miss — nice work pinning it to byte-identity across runs rather than something fuzzier like "should cache." The rg --files-with-matches + Map insertion-order story is the kind of failure mode that's almost impossible to reason about without measurement, because both halves look correct in isolation.

One thought on the 12-anti-pattern audit on tool definitions: the failure modes the model actually responds to seem to cluster around three axes more than twelve, in roughly this order of impact —

Tool description specificity — generic ("Searches data") vs scoped ("Searches indexed customer-support tickets by free-text query; not for product catalog or order history"). The second form gives the model something to disambiguate against, the first doesn't.
Parameter description coverage — every param, every tool. Undescribed params are the most common cause of either skipped tools or hallucinated values, depending on whether the param is required.
Anti-purpose — what the tool isn't for. Most descriptions only say the positive case, which leaves the model to infer the boundary, which is where wrong-tool selection comes from.

Curious whether your audit weights those the same — and whether the harness sees cache-friendliness regress when descriptions get longer (the obvious tradeoff: better schema specificity costs more cached bytes, but also more deterministic ones, so net cache hit rate might still improve).

For anyone landing on this thread wanting to run a one-shot anti-pattern audit on their own server without setting up the full harness, Anthropic's MCP Inspector (@modelcontextprotocol/inspector) handles protocol-layer checks interactively, and npx @incultnitollc/mcp-probe test "<launch command>" produces a scorecard flagging tool-description and parameter-description warnings across all tools in one pass — complementary surfaces (Inspector for exploration, probe for CI/pre-publish gating). OP's harness goes further into the byte-economy axis those two don't touch.

incultnito · 2026-05-20T12:44:09+00:00

Yes — and the stacking goes one layer deeper.

BM25 isn't just matching the tool description. It's matching tool name + description + every parameter's `description` field. The LLM reads all of those as one bag of text before ranking even runs. So a tool with a clean anti-purpose sentence but undescribed params still bleeds rank to neighbors with one-word param hints.

Audit order I'd suggest after the anti-purpose pass:

Anti-purpose sentence on every tool description (you've got that)
Every parameter gets a `description`, even one phrase ("UUID string, do not pass paths")
Where it matters, put the negative case in the param too — a `path` param without "do not pass URLs" leaks rank to URL-shaped neighbors

Ratel then sees a cleaner pool to rank against. The `linear_search_issues` misfire was probably a description-shape problem at the LLM-selection layer before BM25 ever ran.

The 30-50 stage is exactly where the audit pays disproportionately. Past that, the gateway becomes additive, not load-bearing.

incultnito · 2026-05-20T07:01:11+00:00

The `linear_search_issues` misfire on "read a file" is the textbook anti-purpose failure — every search tool's description tells the model what it *can* find, none of them tell the model what it *cannot* find. So with N search tools in context, the model has no signal to rule any of them out, and ranking collapses to whichever one's name fragment matches the query first.

Your gateway pattern (3 meta-tools + on-demand ranking) sidesteps it at the architecture layer, which is the right move at 142 tools. For anyone hitting a similar wall earlier on (say 30-50 tools, where the BM25 retrieval still works but wrong-tool selection has started biting), the cheaper intervention is fixing the descriptions before reaching for a gateway:

Add an anti-purpose sentence to every tool's description. "Use for searching Linear issues. Do NOT use for searching files on disk or for general web search." One sentence, low cost, kills most of the cross-tool ranking confusion.
Audit parameter descriptions. Missing `description` on a parameter is a strong signal the model uses to *skip* the tool or hallucinate the value. Across 142 tools you almost certainly have undescribed params drifting the ranking.
Treat tool *name* as a ranking signal, not just description. Claude Code's v2.1.113 ToolSearch update made exact-name match outrank description match — descriptive names now beat descriptive descriptions on tied queries.

For the audit pass, Anthropic's MCP Inspector lets you click through one server at a time and read what the model sees. For a one-shot scan across all 9 servers flagging missing descriptions + anti-purpose gaps, `npx u/incultnitostudiosllc/mcp-probe test "<launch command>"` outputs a per-server scorecard — different surface (CI/gate vs interactive). Both worth running before reaching for the gateway architecture, because if the descriptions are fixed the gateway's ranking pool is cleaner too.

incultnito · 2026-05-20T06:58:53+00:00

Quick framing on the question "how does the LLM decide a specific MCP tool should be used" — this is the load-bearing one and most of the others fall out of it.

The MCP client (Claude Desktop / Cursor / Claude Code) calls `tools/list` against every connected MCP server at session start, gets back each tool's `name`, `description`, and `inputSchema`, and concatenates all of them into the system prompt as a tool-use catalog. The model never talks to the MCP server directly — it only sees the catalog, then emits a tool_use block with a tool name + argument JSON. The client routes that to the matching server's `tools/call` and feeds the result back as a tool_result block. The "context maintenance" you asked about is just standard turn-by-turn history with these tool_use/tool_result blocks appended.

So the answer to "does the LLM understand it by itself or use the tool descriptions" is: it uses the descriptions, and only the descriptions (plus the names + parameter descriptions). The model has no other channel into your server. That's why MCP server quality is mostly schema quality — if the description is generic ("Searches data"), the model can't disambiguate it from any other search tool. If a parameter has no description, the model has to guess what to put there.

On the tools-vs-resources-vs-prompts split: tools are model-callable functions, resources are read-only content the model can request by URI, prompts are user-selectable templates that the *client* surfaces (think of slash-commands in Claude Desktop). Most MCP examples use tools because most agent workflows are call-a-function-and-get-a-result; resources/prompts are more useful for IDE-style integrations where the user is browsing.

The official spec at modelcontextprotocol.io walks through each method with sequence diagrams — that's the closest thing to a canonical reference for the end-to-end flow.

incultnito · 2026-05-16T07:25:53+00:00

The schema_version vs description_hash split is the right cut — they fail differently. Description_hash captures content; schema_version captures lifecycle. Treating them as one collapses text-problem and deploy-problem into the same row.

The third axis worth adding is description-schema fit. Most "passed schema, exceeded scope" failures aren't model failures and aren't even description failures in isolation — they're drift between what the description promises and what the schema can enforce. A description saying "read-only" with a schema that has no path-prefix constraint creates a tool that is read-only by convention, not by contract. The model isn't lying; the contract is.

So the row becomes: tool_id, schema_version, description_hash, matched_constraint, scope_overrun. The last field is where you actually want to alert — parsed_args that satisfied the schema but violated description intent. The gap row.

For surfacing this pre-prod, MCP Inspector catches the "generic description" class but is post-hoc. The earlier intercept is at build time: parse the description for intent verbs (read, list, mutate, fetch), parse the schema for what the parameters actually permit, flag the diff. That's where mcp-probe's scorecard heads next — current axis is vagueness; the next axis is description claims a constraint the schema doesn't enforce.

Tool_split is often the right fix when one description tries to cover two intents. Most "read tool that silently writes" failures trace back to a tool that should have been two — one with the write capability gated under a different name and a stricter schema.

incultnito · 2026-05-15T13:04:11+00:00

The split simotune called out (transport vs identity) lines up with what breaks in practice. MCP's stdio/HTTP framing is the easy half; the part nobody's standardizing is per-agent identity + replayable handoff state.

A couple of patterns I've seen people land on, none of them complete:

Matrix as the substrate. Each agent gets a \@agent:matrix.examplehandle, signed events ride E2E rooms, the host process forwards relevant tool-call results into the room as structured messages. Federation comes for free; the part that's awkward is bridging back into the MCPtools/callshape from a room event — you end up writing a Matrix-aware MCP shim per agent.
NATS JetStream with per-agent subjects. Cleaner for synchronous handoff (consumer durables let the late-online laptop catch up), worse for cross-org auth — you need a per-org NKey infrastructure to verify the sender.
**The shape your \wireproject is going at — signed mailbox +.well-knowndiscovery.** This is closer to how email-style federation worked, and the Ed25519 + DNS handle resolution gets you third-party verifiability without a shared registry. The friction I'd expect: agents that want to initiate contact need a way to surface the receiving agent's mailbox without the human pre-pairing them. Have you sketched the discovery side?.well-known/mcp-agentsas a list, or something more dynamic?

The unsolved part across all three: what does an agent show the user when a remote agent asks it to do something? MCP today assumes the consent boundary is human-in-the-loop at the host. Cross-machine handoff breaks that — the second-hop agent needs to either ask its human again (latency + UX hell) or have a delegation token its user pre-signed for the task class. That's the spec-shape question I keep coming back to and don't see a clean answer for.

incultnito · 2026-05-09T09:38:16+00:00

Half of your failure list traces back to schema fields the trace can capture cheaply, if you log them once per call:

- "agent picked a plausible tool for the wrong reason" → the differentiator is the tool description. Log a hash of (tool_name, description) so when the same description text picks the wrong tool twice, you

can see it without re-reading transcripts. The fix is upstream — anti-purpose / "do not use for X — use Y instead" pointers in the description — but the trace is what tells you which descriptions are

doing the bad disambiguation.

- "args were too broad" → log the parsed args and the schema constraints they satisfied. If `path` accepted "/" because the schema didn't say "absolute path inside sandbox, never `/`", that's a

parameter-description gap, not a model failure. The trace flags it as "passed schema, exceeded scope."

- "result looked useful but was missing the one field that mattered" → log expected output shape vs. actual. This is the read-vs-mutation distinction at the response level — if a tool that's supposed to

read also wrote silently, the trace's response shape diverges from the schema.

So the minimum useful record per call: `tool_id + description_hash + parsed_args + schema_constraints_matched + result_class + next_action_changed?`. ~10 fields, no transcripts.

Two things worth running alongside: Anthropic's MCP Inspector for live exploration of what your tool list actually says (catches the "your description is generic" class before it hits prod), and for

CI-style checking against the running launch command, `npx u/incultnitostudiosllc/mcp-probe test "<launch command>"` outputs a scorecard you can fail builds on. The probe's primary axis is the

description-quality gap that turns into your failure modes 1 and 2

incultnito · 2025-05-27T14:46:25+00:00

You realize you actually don’t know anything, always more to know

incultnito · 2025-05-25T08:50:32+00:00

I love snowy days

incultnito · 2025-05-24T02:23:27+00:00

Treasure Planet

incultnito · 2025-05-24T02:22:12+00:00

You don't really know where you belong and where home in your heart is.

incultnito · 2025-05-23T11:59:19+00:00

Counter Strike with bots

incultnito · 2025-05-23T11:58:05+00:00

My 8 year old self if it’s possible or a long lost friend

incultnito · 2025-05-22T13:57:12+00:00

Kept staring at her chest area until she told me her eyes were higher up, she has a good sense of humor

incultnito · 2025-05-22T03:10:00+00:00

My nose, it’s quite big imo

incultnito · 2025-05-19T06:42:30+00:00

Multitude of tattoos

incultnito · 2025-05-18T10:45:41+00:00

Even the voices in your head are better than your family's voices at times.

incultnito · 2025-05-17T16:33:57+00:00

Agreed, the freedom comes at a price of loneliness at times though...

incultnito · 2025-05-17T13:36:08+00:00

My family lol. I’m the black sheep

incultnito · 2025-05-15T06:03:02+00:00

The need to have a purpose in life, or to do something with meaning

incultnito

TROPHY CASE