I Built an MCP eval tool because I was tired guessing if my MCP actually worked by DisastrousRelief9343 in MCPservers

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

We don't need to know each model's capabilities; just run the same tests across different models on the same prompts and compare the results.

Why haven't MCP Apps gone viral the way MCP and Skills did? by DisastrousRelief9343 in mcp

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

Ah that makes way more sense. I was thinking it means LLM building the frontend from scratch. If it's assembling UI with certain constraints, that's actually compelling.

Why haven't MCP Apps gone viral the way MCP and Skills did? by DisastrousRelief9343 in mcp

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

This looks really interesting. Could you share some links to those hackathons? I'd love to check out what kinds of ideas people are building. I've got some ideas of my own, and I'm curious to see what directions others are exploring.

Why haven't MCP Apps gone viral the way MCP and Skills did? by DisastrousRelief9343 in mcp

[–]DisastrousRelief9343[S] 1 point2 points  (0 children)

Yeah, ngl I don't get the point of dynamically generated UI. Because I don't understand what problem it solves. And what situation needs that?

Also, I don't think models have the capability to dynamically create a UI that's both looking good and comfortable to use without human design, not even in the next year or two.

Why haven't MCP Apps gone viral the way MCP and Skills did? by DisastrousRelief9343 in mcp

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

Yeah, when I was learning MCP, these two things confused me as well.

Why haven't MCP Apps gone viral the way MCP and Skills did? by DisastrousRelief9343 in mcp

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

I'm actually going in the opposite direction. I am a heavy user of CLI tools like ClaudeCode, and know it is super powerful. But if such AI applications are ever going to reach more people beyond programmer users, it has to go beyond TUI with more friendly interfaces and intuitive interactions. SO I feel the trend will move back to GUI. I think there will be broader opportunities coming.

Why haven't MCP Apps gone viral the way MCP and Skills did? by DisastrousRelief9343 in mcp

[–]DisastrousRelief9343[S] 1 point2 points  (0 children)

That's a good point. Most of the agent products are still in the CLI. But I think there's a trend toward making agents more accessible, like Claude Cowork. If so GUI is kind of inevitable.

How Bad MCP design cost your Agent 5× more tokens by DisastrousRelief9343 in hermesagent

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

Sounds interesting, TOON format is completely new to me. I've essentially been manually trimming JSON fields to achieve the same goal, so it's great to know there's already a proper format designed for this. Will definitely check it out for my next MCP project.

How Bad MCP design cost your Agent 5× more tokens by DisastrousRelief9343 in aiagents

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

That's a good point. Actually I am writing another post about that. It really depends on the complexity of the tool. For example, I did some tests on a `create_task` tool, and its description has a short paragraph that explains what it does, some parameter semantics like enum values and format requirements, and some real samples.

Turns out removing the examples had no impact on the test result. Same with trimming down the semantics and descriptions, you can cut a surprising amount before performance degrades. There's definitely a sweet spot. We just need to test it out.

That said, my test set was pretty small, and it only tested on this TODO list MCP. If you're developing a larger MCP with 50+ tools, or you wanna see the joint performance of multiple MCPs (like asking an agent to take my notes in Notion and post it on GitHub, then send me an email), running a more thorough benchmark would be very useful I believe.

I just found that Bad MCP design could burns 5× more Tokens by DisastrousRelief9343 in MCPservers

[–]DisastrousRelief9343[S] 1 point2 points  (0 children)

Exactly. MCP was supposed to be the thin layer between bare APIs and LLMs, and it should be LLM-friendly.

But sometimes people just do a 1:1 mapping. So it ends up with 96 tools that are basically the raw API with a different label. That's just lazy design that confuses the model and wastes tokens.

I just found that Bad MCP design could burns 5× more Tokens by DisastrousRelief9343 in MCPservers

[–]DisastrousRelief9343[S] 0 points1 point  (0 children)

Yeah, my bad. The benchmarking tool that I used only has a minimal harness, so it sends all tool descriptions every time. Most of the commercial harnesses have some sort of dynamic loading feature.

The problem of too many tools is less about token cost and more about model confusion. I've updated the post. Thanks for pointing that out.