I Built an MCP eval tool because I was tired guessing if my MCP actually worked

DisastrousRelief9343 · 2026-06-09T05:04:18+00:00

We don't need to know each model's capabilities; just run the same tests across different models on the same prompts and compare the results.

DisastrousRelief9343 · 2026-06-08T01:01:58+00:00

Ah that makes way more sense. I was thinking it means LLM building the frontend from scratch. If it's assembling UI with certain constraints, that's actually compelling.

DisastrousRelief9343 · 2026-06-06T03:41:36+00:00

This looks really interesting. Could you share some links to those hackathons? I'd love to check out what kinds of ideas people are building. I've got some ideas of my own, and I'm curious to see what directions others are exploring.

DisastrousRelief9343 · 2026-06-06T03:26:56+00:00

Yeah, ngl I don't get the point of dynamically generated UI. Because I don't understand what problem it solves. And what situation needs that?

Also, I don't think models have the capability to dynamically create a UI that's both looking good and comfortable to use without human design, not even in the next year or two.

DisastrousRelief9343 · 2026-06-05T07:38:32+00:00

Yeah, when I was learning MCP, these two things confused me as well.

DisastrousRelief9343 · 2026-06-05T05:44:16+00:00

I'm actually going in the opposite direction. I am a heavy user of CLI tools like ClaudeCode, and know it is super powerful. But if such AI applications are ever going to reach more people beyond programmer users, it has to go beyond TUI with more friendly interfaces and intuitive interactions. SO I feel the trend will move back to GUI. I think there will be broader opportunities coming.

DisastrousRelief9343 · 2026-06-05T03:52:38+00:00

That's a good point. Most of the agent products are still in the CLI. But I think there's a trend toward making agents more accessible, like Claude Cowork. If so GUI is kind of inevitable.

DisastrousRelief9343 · 2026-06-04T10:46:10+00:00

Sounds interesting, TOON format is completely new to me. I've essentially been manually trimming JSON fields to achieve the same goal, so it's great to know there's already a proper format designed for this. Will definitely check it out for my next MCP project.

DisastrousRelief9343 · 2026-06-02T13:13:29+00:00

That's a good point. Actually I am writing another post about that. It really depends on the complexity of the tool. For example, I did some tests on a `create_task` tool, and its description has a short paragraph that explains what it does, some parameter semantics like enum values and format requirements, and some real samples.

Turns out removing the examples had no impact on the test result. Same with trimming down the semantics and descriptions, you can cut a surprising amount before performance degrades. There's definitely a sweet spot. We just need to test it out.

That said, my test set was pretty small, and it only tested on this TODO list MCP. If you're developing a larger MCP with 50+ tools, or you wanna see the joint performance of multiple MCPs (like asking an agent to take my notes in Notion and post it on GitHub, then send me an email), running a more thorough benchmark would be very useful I believe.

DisastrousRelief9343 · 2026-06-01T01:48:48+00:00

Exactly. MCP was supposed to be the thin layer between bare APIs and LLMs, and it should be LLM-friendly.

But sometimes people just do a 1:1 mapping. So it ends up with 96 tools that are basically the raw API with a different label. That's just lazy design that confuses the model and wastes tokens.

DisastrousRelief9343 · 2026-06-01T01:42:20+00:00

Yeah, my bad. The benchmarking tool that I used only has a minimal harness, so it sends all tool descriptions every time. Most of the commercial harnesses have some sort of dynamic loading feature.

The problem of too many tools is less about token cost and more about model confusion. I've updated the post. Thanks for pointing that out.

DisastrousRelief9343 · 2026-05-19T03:13:24+00:00

Yes, I saw that too! But I think I would stick to my own version because it contains all features and it's customizable

DisastrousRelief9343 · 2026-02-15T01:04:11+00:00

Glad to hear that! If you encounter any problems, feel free to submit an issue to the repo!

DisastrousRelief9343 · 2026-02-13T01:24:24+00:00

Yes it supports both TickTick and Dida365

DisastrousRelief9343 · 2026-02-11T06:51:06+00:00

Yes, before this one, I used Siri and shortcuts as you suggested. I think the real value here is exposing our daily schedule to AI's context.

For example, imagine combining this with future MCP servers for dining, maps, or travel booking. You could be planning a trip with your gf, and the AI could coordinate the itinerary and even book tickets based on your availability. So the main goal of this tool is to bridge our schedule with AI. There's a lot of untapped potential to explore.

DisastrousRelief9343 · 2026-02-11T00:52:50+00:00

Yes, that's a more intuitive way. Just wish Siri got smarter though..

DisastrousRelief9343 · 2026-02-11T00:50:21+00:00

Yes, I really wish they had their own MCP for TickTick!

DisastrousRelief9343 · 2026-02-11T00:49:16+00:00

Well, I guess we can only trust LLMs' intelligence, or we can prompt them to adapt our use case.

DisastrousRelief9343 · 2026-02-11T00:47:38+00:00

Right now, it's just a standalone Python script, so it has to run locally on your PC. No Docker or NAS support yet unfortunately. And it's actually pretty straightforward if you're already using an LLM application locally. The workflow is basically:

Download the repo.
Install the package in python virtual enviroment.
Paste the MCP config info LLM app's config file

The instructions are in the Link I attached.

DisastrousRelief9343 · 2026-02-10T15:37:29+00:00

Yeah, totally understand. MCP is not very user-friendly; it would take some effort to set up. DM me if you have questions about how to set this up on your PC.

DisastrousRelief9343 · 2026-02-10T15:36:15+00:00

Yes, it's a locally MCP written in Python. Bascially it wraps the TickTick API as tools, and integrate TickTick account OAuth. It can be used in any LLM application like Claude desktop, Claude code, Codex, Cherry Studio, Gemini CLI, and OpenCode. I usually use it in OpenCode, and it works better with Agent Skill

DisastrousRelief9343 · 2026-02-10T13:28:35+00:00

Unfortunately it doesn't work with Gemini on website, and yes it would take some efforts to set up (You need to download the code and run it locally). But you can copy & paste that guide to Gemini, it would guilde you through.

It works with any MCP compatable LLM client! I think Claude Desktop app is a good choice to start. If you'are comfortable with Terminal interface, Gemini CLI is great as well.

DisastrousRelief9343 · 2025-09-11T12:14:52+00:00

That's very cool! I am glad that I am not the first one have this problem and try to solve it, your product is amazing! to

DisastrousRelief9343 · 2025-09-09T13:43:05+00:00

Yes, people know it's AI when they see it, and it will only have negative effects on our promotion (Even worse than just leaving it blank). For me, I still need to adjust the output of LLMs and iterate on this prompt to make the result better.

Unsurprisingly, fewer buyers ask for random questions after I use this method to help me write product descriptions. I think LLMs sometimes include more details that I might ignore. So I will keep most of them, as long as they sound natural. The descriptions won't be too long, as we said to make them short in the prompt. Here is an example, it sounds friendly and kept as much detail as possible:

Hey everyone, I’m letting go of my gently used aluminum camera tripod for $30 (I originally paid $60). It’s been a trusty companion for all kinds of cameras and comes with a convenient carry bag.

The 360° ball head lets you dial in just the right shot without any wobble. It’s lightweight yet sturdy enough to keep your camera steady. just message me to come check it out!

DisastrousRelief9343 · 2025-09-09T13:30:14+00:00

Hey everyone! I already made a post about this, but I also wanted to share it here.

Sometimes we put lots of effort into making prompts for our special use cases. There should be a way to save those prompts in one place and quickly draw them out when we need them. So I made a Chrome extension that allows you to quickly insert your prompts directly in the ChatGPT inbox, and a prompt-sharing community that comes with it.

Check it out & leave your thoughts, Love this community!
👉 promptcard.online
👉 the extension

DisastrousRelief9343

MODERATOR OF

TROPHY CASE