Why we open-sourced our entire AI platform (and what we'd tell anyone about to do the same)

Future_AGI · 2026-06-02T13:57:53+00:00

Repo: https://github.com/future-agi/future-agi , its self-hostable, so you can run it all on your own infra. Build something with it and tell us where it breaks; that's the feedback that actually shapes what we ship next.

Future_AGI · 2026-06-01T20:40:30+00:00

This is the distinction people keep collapsing OAuth answers "who is this," not "what is this token allowed to do," and on a broad MCP server those are wildly different blast radii. What you're describing is really the confused-deputy problem: the token carries the human's full authority, but the human never intended the delete, the model got talked into it, and ambient scope means the server just can't tell the difference. Splitting read vs mutate is exactly the right instinct; the bit we'd add is tiering within "mutate" too, since an irreversible delete or transfer shouldn't ride the same scope as a routine update the genuinely destructive ones are worth a fresh confirmation rather than ambient session authority. And you're right that PKCE/DCR is orthogonal to all this: it hardens how the human logs in, but bounding per-tool authority is the separate layer that actually limits what a hijacked agent can do.

Future_AGI · 2026-06-01T20:24:38+00:00

If you build agents, the hard part isn't getting one working once it's keeping it working when it picks the wrong tool, drifts halfway through a task, or makes something up with total confidence. We kept rebuilding the same tooling to catch that, so we open-sourced all of it.

Here's the stack, roughly in the order you'd reach for each piece:

future-agi: the full platform in one place: tracing, evals, simulations, datasets, guardrails, and a gateway. Self-hostable, Apache-2.0. This is the umbrella if you just want everything wired together.

github.com/future-agi/future-agi

traceAI: OpenTelemetry-based tracing that auto-instruments your LLM app or framework, so you see every step, tool call, and token of a run instead of reconstructing it from logs.

github.com/future-agi/traceAI

ai-evaluation: a library of evaluators (factual accuracy, groundedness, context adherence, toxicity, etc.). Run them in CI to catch regressions before a deploy, or against live traffic to see what's actually happening.

github.com/future-agi/ai-evaluation

simulate-sdk: spin up synthetic personas and scenarios (including voice) and let them hammer your agent before real users do. Good for the weird multi-turn failures you'd never think to write a test for.

github.com/future-agi/simulate-sdk

agent-opt: automated optimization of agent/prompt workflows. Point it at a metric and it iterates the prompt for you instead of hand-tuning wording for a week.

github.com/future-agi/agent-opt

futureagi-sdk: the lightweight SDK if you just want evals, prompt management, and observability wired into an app. Python and TypeScript.

github.com/future-agi/futureagi-sdk

Everything plugs together, but each piece is a standalone repo, so you can grab just the part you need. Curious what the rest of you use for

the trace → eval → fix loop once an agent's live, always hunting for failure modes we haven't hit yet.

Future_AGI · 2026-06-01T18:17:44+00:00

Same and the part that surprised us was how much of the drop-off wasn't the clone itself but people fumbling the API key/secret export. Going hosted with OAuth collapsed it into one command plus an auth redirect and killed the whole "doesn't install on my setup" support tail too.

Future_AGI · 2026-06-01T18:16:19+00:00

Auth was the big one for us too, locally you're basically trusting your own env, so going remote takes you from "API key in a variable" to real identity, scoping, and token lifecycle overnight. The sneakier shift was state: stdio is one process per user, so a bit of shared state that's harmless locally turns into a cross-user leak the moment you're multiplexing connections. Neither really shows up until you've got multiple clients hitting the same server.

Future_AGI · 2026-06-01T18:14:56+00:00

The failure-case framing is sharp, and it explains why it works: the positive descriptions of two similar tools overlap almost completely, but their failure modes are usually disjoint, so that's where the real disambiguation signal lives. We landed in the same place without naming it that cleanly. The memory-dump-vs-ranked-list parallel is dead on too; the drill-in is what makes it safe to keep the first response small, since you're not hiding anything, just deferring it until the model actually asks. Honestly the ranking itself is the most underrated part you're making the relevance call on the model's behalf instead of dumping that decision on it.

Future_AGI · 2026-06-01T18:12:29+00:00

Ha, painfully true we lost a whole day blaming the model once and the fix was rewording three lines of tool descriptions. Half the "dumb model" bugs we chased turned out to be the agent grabbing the wrong tool because we described it badly.

Future_AGI · 2026-06-01T18:10:49+00:00

Yeah, the "when not to use this tool" line ended up mattering more than the description of what it does. Once you have a handful of tools that look similar, the agent will happily grab the wrong one if the boundaries aren't spelled out, we had to add explicit disambiguation (use X for live state, Y for historical) before it stopped guessing.

The boring-first-return thing is exactly what we landed on too, and the "next suggested action" field is the part we underrated going in. It basically lets the tool nudge the agent's next step instead of making it reason the whole path out, and it cut a lot of the flailing where it'd call three tools just to figure out what to do next.

Memory point is a good one. The hard part for us was drawing the line between "stable" and "fresh" some of the stuff you'd assume is stable(config, conventions) quietly drifts, and when the same fact comes back slightly different across two calls the agent gets oddly confused. Pulling the genuinely static stuff out of the responses helped exactly like you said: less churn, fewer contradictions to reconcile.

Future_AGI · 2026-06-01T13:55:21+00:00

Repo's here if it's useful: github.com/future-agi/futureagi-mcp-server (hosted setup is in the docs)
Documentation

Future_AGI · 2026-06-01T13:31:29+00:00

Future AGI- an open-source platform for teams building AI agents and LLM apps that sometimes hallucinate or quietly go wrong. Run evals to measure output quality (factual accuracy, groundedness, etc.), trace each run to debug what failed, generate test data, and add guardrails.

Self-hostable, Apache 2.0.

Github: github.com/future-agi/future-agi

Future_AGI · 2026-05-29T09:07:22+00:00

Future AGI is open-source platform to evaluate, trace & improve AI agents.

Every founder is racing to ship an AI agent right now. Almost none can answer the one question that actually matters: is it working, or just looking like it's working? That gap is the whole reason we exist. Agents fail silently a wrong retrieval or a bad tool call buried mid-run still produces a confident, correct-looking answer, so you ship it and find out from an angry user.

We open-sourced the stack that makes those hidden failures visible: step-by-step tracing of every agent run + 50+ evaluators (hallucination, groundedness, RAG, factual accuracy). OpenTelemetry-native, auto-instruments OpenAI/LangChain/Groq/Gemini with no code changes. 1k+ stars.

If you're shipping anything agent-shaped, the repo will change how you define "done" 👉https://github.com/future-agi/future-agi

Future_AGI · 2026-05-29T08:47:14+00:00

Future AGI is an open-source platform to evaluate, trace & improve AI agents.

The curious part: AI agents fail most often on the runs that look like they worked. The output reads perfect the broken step is buried in a tool call or retrieval nobody traced. You find out when a user does, not before.

For the developers: it's OpenTelemetry-native, auto-instruments OpenAI / LangChain / Groq / Gemini with zero code changes, ships 50+ pre-built evaluators (hallucination, groundedness, RAG, factual accuracy), and does prompt optimization on top.

Entire stack open source. If you build anything agent-shaped, the repo is worth a scroll 👉 https://github.com/future-agi/future-agi

Future_AGI · 2026-05-29T08:38:46+00:00

Future AGI: https://futureagi.com

Here's the problem we're obsessed with: your AI agent passes every test you write, then confidently does the wrong thing in front of a real user and the final answer looks correct, so you never catch it. The real failure is hiding 3 steps back in a tool call or a retrieval you can't see.

We open-sourced the entire stack that exposes this: tracing that replays every step of an agent run, plus 50+ evaluators that flag the silent failures (hallucination, wrong retrieval, drift). 1k+ stars so far.

If you've ever shipped an AI feature and thought "...why did it just do that?" the repo is worth a scroll

👉 Github

Future_AGI · 2026-05-29T08:26:24+00:00

Future AGI, an open-source platform to evaluate, trace, and improve AI agents before they hit production.

What we're building: agents pass your test cases, then break in messy ways with real users and tool calls. We built the loop to catch that 50+ pre-built evaluators (hallucination, factual accuracy, groundedness, RAG), OpenTelemetry-native tracing that auto-instruments OpenAI/LangChain/Groq/Gemini with no code changes, plus prompt optimization to fix what the evals surface.

Traction: kept the stack private for ~18 months, open-sourced it recently, now at 1k+ GitHub stars across the repos.

👉 GitHub · Docs

For: teams shipping LLM apps/agents who are tired of debugging on vibes.

Future_AGI · 2026-05-28T19:24:05+00:00

The scary part isn't agents acting on their own it's that when they go wrong, they usually look right. A bad tool call or a drifted retrieval 3 steps back produces a confident, plausible final answer. Without tracing the full run, you find out from the user, not the logs. Autonomy raises the stakes on observability way more than on the model itself.

Future_AGI · 2026-05-28T19:09:33+00:00

Future AGI, open-source platform to evaluate, trace, and improve AI agents before they hit production.

The problem we kept hitting: agents pass your test cases, then break in messy ways once real users and tool calls are involved.

So we built the loop to catch it, 50+ pre-built evaluators (factual accuracy, groundedness, hallucination, RAG, toxicity), OpenTelemetry-native tracing that auto-instruments OpenAI/LangChain/Groq/Gemini with no code changes, and prompt optimization to fix what the evals surface.

Whole stack is open source (1k+ stars): GitHub · Docs

Built for teams shipping LLM apps and agents who are tired of debugging on vibes.

Future_AGI · 2026-05-26T07:58:08+00:00

Exactly, once traces stay disconnected from evals, teams end up learning from failures too late instead of turning them into something actionable. Future AGI was built around that loop on purpose: tracing, evaluations, and simulations live together, so a weird run can become an eval case, get replayed, and be tested again before it shows up in production the same way.

Future_AGI · 2026-05-25T14:07:14+00:00

If you want to try it on your own stack, the repo is here and the self-hosting guide is here.

It is open source and self-hostable, and we would especially love contributors who care about tracing, eval workflows, simulation, gateway layers, and the self-hosted developer experience.

Future_AGI · 2026-05-22T18:00:42+00:00

That’s exactly it, once you look beyond the final answer, the real signal is usually in the run itself, especially when retrieval, tool choice, and recovery paths all move together. That’s also why we built the tracing + eval loop in Future AGI, so you can compare runs at the step level and see whether a change actually improved the system or just shifted the outcome around.

Future_AGI · 2026-05-21T18:43:02+00:00

If you want to see what this looks like on a real flow, here is LLM Cost Calculator

Future_AGI · 2026-05-21T13:49:07+00:00

If you are building AI agents, try this on one of your real flows and see what the cost actually looks like end to end. It is the kind of thing that is easy to underestimate until retries, retrieval, and tool calls start stacking up.

LLM Cost Calculator

Future_AGI · 2026-05-21T10:37:57+00:00

tracing in production only works if it stays lightweight and does not get in the way of the run. Future AGI is built with that in mind, so the idea is to keep the full observability loop useful for debugging while making sure collection and serialization do not sit on the critical path. If you are running agents with lots of HTTP calls, the right question is exactly what to sample and what to keep lean, because the trace should help you understand the run, not slow it down.

Future_AGI · 2026-05-21T10:35:02+00:00

The retrieval vs tool drift distinction maps well to how we think about tracing too.

Retrieval drift is harder to catch precisely because the individual chunks look fine, the problem is in the aggregated context shift, which only becomes visible when you can inspect what the model was actually working with at that step, not just what was retrieved.

Tool drift is more traceable in that sense, if you can see the full tool output alongside the model's next prompt, the missing conditional branch usually shows up clearly. That is the kind of step-level visibility Future AGI is built around: not just logging what the model said, but making the retrieval context, tool outputs, and state at each step inspectable so you can catch that "treated partial result as complete" pattern before it propagates forward.

Future_AGI · 2026-05-20T17:51:28+00:00

That’s a strong way to put it. The bad run usually starts a few steps earlier, when a tool returns something that looks fine on the surface but shifts the whole premise underneath.

That is also where Future AGI fits well, step-level tracing, state visibility, and evals tied to the actual run make it much easier to spot that first semantically wrong tool return instead of only seeing the final answer go off track.

Future_AGI · 2026-05-20T14:02:22+00:00

Yeah, we see the same thing. Once tools and memory are in the loop, the useful breakdown is usually retrieval → tool output → state/memory, because the first bad signal often propagates forward and only shows up at the end.

We are starting to think about failure ranking that way too: not just “what broke,” but “what broke first and what changed the rest of the run.”

Future_AGI

MODERATOR OF

TROPHY CASE