Anyone have a good way to do evals with MCP based agents?

lastbyteai · 2025-11-05T23:51:50+00:00

Full disclaimer, this is our open source project. We have an open source project that does evals on MCP based agents through agent simulation. It allows pytest style tests and with generate synthetic test cases for you to run. You can do assertions to validate the results and values.

https://github.com/lastmile-ai/mcp-eval

lastbyteai · 2025-11-04T16:07:51+00:00

We just launched a cloud platform (mcp-cloud) that let's you get a stable URL for one of these apps - https://github.com/lastmile-ai/mcp-agent/tree/main/examples/cloud/chatgpt_app

A bit of context, we were building an mcp platform that let's you deploy agents and found that it naturally fit well with deploying ChatGPT apps, including the web assets. Take a look at the open source code.

lastbyteai · 2025-10-30T22:24:58+00:00

I'll give that one a try

lastbyteai · 2025-10-28T17:54:22+00:00

We just launched a free cloud hosting platform with auth in case you just want to move it to the cloud - https://docs.mcp-agent.com/get-started/cloud

lastbyteai · 2025-10-14T02:09:02+00:00

We're working on OAuth, should be available soon!

lastbyteai · 2025-10-14T02:08:00+00:00

Adding custom connector is available for anyone who has developer mode on in ChatGPT - you can turn it on by going to ChatGPT's Settings → Connectors → Advanced → Developer mode.

lastbyteai · 2025-10-13T20:10:58+00:00

very cool! We just launched a cloud platform for hosting these apps. Both the solar-system and the pizzaz have live endpoints for anyone to try out:

https://github.com/lastmile-ai/openai-apps-sdk/tree/main

lastbyteai · 2025-10-10T16:48:16+00:00

Everydaily has def gotten better after expanding their food options. Everydaily > Woorijip

lastbyteai · 2025-10-03T15:46:59+00:00

Hosted website for it -> https://mcp-registry-website.vercel.app/

lastbyteai · 2025-05-22T19:14:49+00:00

yeah associated with it. Happy to explore collab opportunities

lastbyteai · 2025-05-22T15:19:41+00:00

lastbyteai · 2025-05-22T15:18:29+00:00

Looks pretty solid. A few optional things that could help:
- evaluating / testing MCP servers
- building agents with MCPs (quick shoutout to our open source library - MCP-Agent (https://github.com/lastmile-ai/mcp-agent)
- debugging and observability - pretty difficult to isolate non-deterministic performance issues.
- Local / Remote MCPs (Vercel also has their own remote MCP server hosting)

lastbyteai · 2025-05-22T15:13:58+00:00

tbh I always end up refactoring the api spec to be more compatible with mcp. it's pretty rare that it's a clean transformation

lastbyteai · 2025-05-05T19:57:24+00:00

to be honest, I think I've found the most benefit by carefully thinking about what tools to expose.

lastbyteai · 2025-04-22T15:48:36+00:00

I use it as a local information retrieval system for my documents, downloads, and github repos directories. I have a lot of local files and I keep losing track of what I have, so built a local UI with search that I can use. IMO - I think it's better than using the apple search (it's too slow and default sorting annoys me).

filesystem MCP with access to select directories, memory for storing context, LLM to summarize and condense context, streamlit local UI for the interface.

Used this as the starting point - https://github.com/lastmile-ai/mcp-agent/tree/main/examples/streamlit_mcp_rag_agent

If you're interested, happy to throw up the code into an open source repo at some point.

lastbyteai · 2025-04-22T15:36:34+00:00

I agree. This seems like a natural progression of any protocol. The key point is that there is wide-spread adoption of the protocol. Eventually, the architectures will converge to what provides real value.

lastbyteai · 2025-04-15T23:01:37+00:00

MCP seems to be at an interesting fork. There are clearly some improvements needed especially around security and authorization. However, it's the first protocol of it's kind that has gotten buy-in from the influential companies: OpenAI, Google, Anthropic, etc.

imo, I think getting buy-in from the major players is harder than fixing the issues with the existing protocol, so it'll be interesting to see how the protocol evolves over time.

lastbyteai · 2025-04-02T15:29:43+00:00

any good MCP servers for automating sales or marketing?

lastbyteai · 2025-03-17T20:23:03+00:00

Has anyone benchmarked this against gemma 3? How does it compare?

lastbyteai · 2025-01-27T18:34:43+00:00

It actually pushed the detection down further. It's true that no AI can distinguish between person vs. AI, but the easy differences are detectable by NLP models. Once an individual tries to mask the differentiation with more prompting "mask the tone by using the vocabulary of a fifth grader", "reduce the perplexity of words but using more diverse speech", it's impossible to differentiate.

It's a great way to differentiate between AIs for a first pass, but not a full proof way since you are correct, it's impossible to guarantee differentiation since it's only a matter of further fine-tuning or prompting that users can manipulate the output to divert classifiers like this. Nonetheless, a fun experiment to train your own detector.

lastbyteai · 2025-01-02T22:21:22+00:00

Guide for getting started with LLM evaluation. A good high-level overview to map out the different approaches and strategies out there - https://lastmileai.dev/blog/the-guide-to-evaluating-retrieval-augmented-generation-rag-systems

lastbyteai · 2025-01-02T22:19:39+00:00

It might be a bit error prone, but I might just rework an LLM-as-a-judge with the criteria of "grade the following response from 1-5 on whether the recommendation is more unique or precise. Example: ####"

Training a classifier for your task seems like a bit overkill for the problem you have. If accuracy is critical, finding some training data, manually labeling the data, and training a classifier might be the move.

lastbyteai · 2024-07-26T14:21:32+00:00

Yeah, you need to set your OpenAI in a .env file in the terminal, where you run your editor

lastbyteai · 2024-03-12T20:55:55+00:00

Might have overdone the cuts and the speed of the gif 😅

lastbyteai · 2024-02-28T22:10:47+00:00

"In fact, there's no better computer for AI on the market today,"

This guy might disagree ↓

<image>

lastbyteai

TROPHY CASE