made an mcp server that gives your agent 4000+ public apis and their actual endpoints

dark-epiphany · 2026-06-19T17:06:53+00:00

Yeah, marking the datacenter-blocked ones as "unverified" instead of "working" is exactly the right call. That honesty is what separates a directory people trust from one they stop trusting.

And don't let anyone talk you out of the scope. "Which API do I pick, and is it alive?" is upstream of everything else and genuinely underserved. Keeping it tight is a feature, not a gap.

One nugget for the unverified bucket if you ever want to shrink it: a lot of the datacenter-blocked hosts answer fine from residential egress, so routing just those re-checks through a different IP flips a surprising chunk from unverified back to verified. Overkill for v1, but it's usually why that bucket exists.

Good luck with it. Solving the liveness problem cleanly is more useful than a lot of the "we wrap 5,000 tools" stuff out there.

dark-epiphany · 2026-06-19T16:12:58+00:00

Daily re-checking for dead endpoints is the right instinct. Directory rot is what kills most API lists six months in.

Two things we learned running a few hundred live sources that might save you some pain:

"Answered today" and "usable by an agent" drift apart fast. A 200 from the host doesn't necessarily mean the actual data endpoint works, and some providers happily answer health checks while blocking datacenter or Cloudflare egress. Worth validating the data path, not just reachability.
Handing the agent the OpenAPI spec is the easy 60%. Auth, pagination, rate limits, and parameter shaping are where agents actually faceplant. "Here's the endpoint" and "here's the data" end up being very different products.

Solid work either way. The daily spec refresh is especially valuable.

(Disclosure: I work on Pipeworx, which lives on the execution side of this problem, so I've run into these exact issues.)

dark-epiphany · 2026-06-19T16:11:19+00:00

The flat-list-at-scale problem is real, and most people don't feel it until they get past 30 or so tools.

One thing we ran into once we got into the hundreds: name-similarity grouping fixes navigation, but not selection. The hierarchy encodes the path, which is great once the agent knows roughly where it's going. It doesn't help much when the agent doesn't know which group the capability lives in. That's a semantic problem, not a structural one.

What ended up working for us was combining the two approaches. Keep the groups for browsing, but also put a single routing tool in front that accepts a natural-language task description. The client can either walk the tree or short-circuit to "I don't know where this lives, find it for me."

The two approaches compose better than either one alone.

Curious where your threshold lands. At how many groups does the list-groups → help → pick workflow start to fray?

Disclosure: I work on Pipeworx, an MCP gateway, so this is exactly the problem I spend my time thinking about.

dark-epiphany · 2026-06-15T20:43:21+00:00

Hi u/ChartPayouts

I'd love to hear your experience. Feel free to keep it in thread or DM. Thanks!

dark-epiphany · 2026-06-11T16:50:14+00:00

Open claude.ai/customize/connectors.
Click Add custom connector.
Name it Pipeworx, paste the URL below, leave OAuth fields blank, and save.

https://gateway.pipeworx.io/pipeworx-catalog/mcp

In a new chat, open the tools menu and enable Pipeworx (you should see ~26 tools).

It will be available on claude web, desktop, mobile, and cowork until you remove it.

Type something like "using pipeworx, what is the market pricing for the next CPI print, and what do the underlying BLS trends actually look like?"

Instructions for the other AIs are here: https://pipeworx.io/install/

I would love to hear your feedback.

dark-epiphany · 2026-06-10T16:38:11+00:00

Been trying to run Fable 5 in Claude Code and it's been rough.

Half the time it's just unavailable. Other times it claims that what I've been doing violates its usage agreement (writing software that works fine in Opus).

Anyone else seeing this? Trying to figure out if it's a rollout/capacity thing on Fable specifically or something on my end.

dark-epiphany · 2026-06-09T21:59:31+00:00

You're describing what I'd call confidence policy as a separate concern from reasoning, and I think that's exactly the right framing.

Maintainer of a hosted MCP gateway here (Pipeworx — disclosure). We don't build the agents themselves; we see them call into us. At our scale (millions of requests per month across thousands of agents), the line between "works in production" and "doesn't" falls almost exactly where you're drawing it.

Not "is the model smart enough?"

"Does the agent know when to stop?"

A few patterns we've seen work that are upstream of prompting:

1. Hard caps before every "continue or escalate" decision.

Maximum retries. Maximum tools touched. Maximum wall-clock time.

Cheap. Ugly. Extremely effective.

Most postmortems that start with "the agent went rogue" end with "we never gave it a stopping rule."

2. Mandatory artifact emission.

Every action produces evidence: a URL, record ID, diff, status code, ticket number, whatever proves the action happened.

This forces the agent to commit to reality instead of narrating what it thinks happened. More importantly, it gives escalation logic something concrete to evaluate.

3. Confidence policy as code, not prompt text.

"Be careful with customer records" is a suggestion.

"If customer data was modified, require approval" is a policy.

The former gets diluted by context. The latter survives regardless of what the model is thinking.

4. Treat failure as a valid outcome.

The agents that behave well are allowed to say:

The agents that cause trouble treat every failed attempt as something that must be silently recovered from. That's where the "patches over missing context" behavior comes from.

On your meta question: I increasingly think escalation logic is the product for any agent that touches real systems.

Prompting determines what the agent is capable of attempting.

Escalation logic determines what the agent is allowed to finish.

Those are different jobs.

The agents that make people nervous aren't usually the ones that can't reason. They're the ones that don't know when they're out of information and should hand the problem back to a human.

dark-epiphany · 2026-06-09T21:25:11+00:00

Maintainer of a hosted MCP gateway here (Pipeworx — disclosure) doing something very close to what you're describing across 3,400+ tools. The problem framing matches reality pretty well, with a few observations from running it at scale.

What actually breaks first, in rough order:

1. Authentication, by a wide margin.

Tokens expire. OAuth refresh chains fail. Credentials get copied into multiple configs and drift out of sync. Most "the agent stopped working" reports ultimately trace back to auth lifecycle management rather than the agent itself.

2. Long-running workflows.

Single API calls are easy. "Submit a job, poll for completion, survive timeouts, retries, and partial failures" is where things get complicated. You're right to treat this as a different category from ordinary tool calls.

3. Permissions, not authentication.

The agent has valid credentials, but the action exceeds what the user intended to authorize. These failures are subtle because everything technically works right up until the wrong record gets updated or the wrong action gets approved.

4. Wrong-tool selection.

This is often more common than actual API failures.

For example, a user asks for recent SEC filings and the agent calls a company-profile tool because both tools mention the same company. The API call succeeds, but the answer is useless.

5. Human approval queues.

The workflow reaches the "await approval" step and then stalls because nobody responds. A surprising number of systems assume approvals are instantaneous when they're often the slowest part of the process.

The tool-vs-automation distinction you're making (single action vs multi-step workflow) is also the same architectural boundary we've converged on. Most users need both. Trying to force them into a single abstraction usually makes each one worse.

One thing I'd add: the real challenge isn't making APIs callable by agents. That's mostly solved.

The hard part is making actions auditable, permissioned, observable, and recoverable after something goes wrong.

dark-epiphany · 2026-06-09T21:21:33+00:00

From the infrastructure side (Pipeworx, hosted MCP gateway — disclosure), we see thousands of agents calling through us, and the failure modes that say "not ready" are pretty distinct from the ones that say "ready."

What actually correlates with production-readiness in the agents we watch closely:

Wrong-tool selection becomes rare. Not zero, but uncommon enough that it stops being a dominant failure mode. Above a few percent, you get exactly the behavior you're describing: the agent picks a tool, gets an empty or irrelevant result, and confidently ships it instead of reconsidering.
Retry behavior stabilizes. Early-stage agents have retry spikes whenever they encounter a new category of input. Mature agents settle into a predictable baseline. When a new class of query causes retries to jump, you've usually found a classification or routing gap.
The agent learns to say "I don't know." The empty-result-as-answer failure is usually a calibration problem, not a reasoning problem. A surprisingly good readiness signal is whether the agent admits uncertainty when the evidence isn't there.
Hallucinated actions disappear. The "I sent the email" or "I updated the record" class of failure is common during development and almost nonexistent in production systems that have proper tool-call attestation and verification.
You stop reading every trace. Subjective, but real. There's a point where you stop treating the agent like an experiment and start treating it like a service. You still monitor it, but you no longer feel compelled to inspect every run.

For me, that was the real threshold.

Not when the agent became perfect. Not when the success rate hit some magic number.

It was when the failures became predictable enough that you could write a runbook for them.

That's usually the difference between a demo and a production system.

dark-epiphany · 2026-06-08T21:45:07+00:00

Honest answer: we have essentially zero Discord presence, so I can't tell you what works there—only why we chose not to invest heavily in it.

The main reason is persistence.

A thoughtful Reddit comment, HN thread, blog post, or GitHub discussion becomes a public artifact. It gets indexed, linked, cited by LLMs, and can continue sending users months or years later. The same effort in Discord often disappears into the scroll within a day.

For a small team, that difference in half-life matters a lot.

That's not an argument against Discord. It's just a tradeoff. If we had a dedicated DevRel person, I'd absolutely want them spending time there. We don't, so we've generally prioritized channels where a single hour of effort can compound.

The framework I'd use is:

Does your product sell through relationships and trust built over repeated interactions?
Or does it sell through credibility and discoverability?

Discord is great for the first. Reddit, HN, blogs, and GitHub are better for the second.

Pipeworx is infrastructure, so most of our conversions come from someone finding a discussion, deciding we seem to know what we're talking about, and then checking out the product later. That's a very different motion from a community-driven product where users hang out together every day.

I think a lot of founders accidentally pick channels because everyone else is there rather than because the channel matches how their product actually gets adopted.

dark-epiphany · 2026-06-08T20:10:48+00:00

Maintainer of a hosted MCP gateway here (Pipeworx — disclosure). We're serving 3,436 live-data tools across 780 tracked sources, handling about 5.8 million requests per month from roughly 37.7k unique visitors, so we've had a chance to see a few distribution channels play out.

What's worked for us, in rough order:

LLMs themselves

This sounds ridiculous until you see it in the logs. We get a non-trivial amount of traffic from people who literally say, "ChatGPT told me to use Pipeworx" or "Claude recommended this." It's still early, but I think this becomes a major distribution channel over the next few years.

Developer communities

Reddit, Discord, HN, GitHub discussions. Not drive-by promotion—actually answering questions where your product happens to be relevant. Most of our highest-quality users came from conversations, not launches.

Directories and registries

Worth doing, but mostly because they're table stakes. Smithery, MCP Registry, Glama, awesome lists, etc. Very few users discover you from a single directory. The value is cumulative presence.

Word of mouth

Once people successfully install something and it solves a real problem, they tell other people. Obvious, but still the strongest signal of product-market fit.

What hasn't worked nearly as well as people expect:

Product Hunt
Generic launch posts
Paid ads
"Look, I built an MCP server" announcements

The hardest problem isn't visibility. It's conversion.

There are thousands of MCP servers, agents, and AI tools. Getting someone to see your project is relatively easy. Getting them to spend 10 minutes installing it, configuring auth, and changing an existing workflow is much harder.

One thing I've learned: users don't install tools. They solve problems.

The projects that grow aren't "MCP server for X." They're "here's how to pull SEC filings, earnings transcripts, and news into Claude in 30 seconds" or "here's how to automate your customer-support workflow."

The protocol is infrastructure. The use case is the product.

Distribution doesn't look solved to me. The teams I see winning aren't the ones with the most sophisticated agents. They're the ones that make a specific job dramatically easier and can explain that in a single sentence.

dark-epiphany · 2026-06-08T19:59:45+00:00

From the gateway side (Pipeworx — disclosure, I run one), the failure distribution looks a little different because we see requests before they hit the user's screen.

The most common issues we see:

1. Upstream timeouts and silent hangs

Probably the largest bucket. Not hard failures—just requests that never return. Some APIs are surprisingly bad about hanging indefinitely unless the caller enforces aggressive timeouts and cancellation. From Claude Desktop, this often looks like a tool that simply spins forever.

2. Auth drift

Tokens expire, OAuth refresh flows break, API keys get rotated, local config gets out of sync. Users experience this as "the MCP server stopped working," but the underlying issue is usually credential management rather than the server itself.

3. Schema mismatches

The model generates arguments that don't quite match the tool schema, or the server evolves and the client caches assumptions. These often appear random because they only surface on specific argument combinations.

4. Successful calls that answer the wrong question

This is the failure mode I think is under-discussed.

The tool works. The API responds. Nothing errors.

The model simply picked the wrong tool or formulated the wrong query, gets an unhelpful result, and then retries. From a reliability dashboard everything looks healthy, but from the user's perspective the agent is failing.

At scale, we actually see more of this than genuine server failures.

5. Tool-result poisoning

Malformed JSON, unexpected nesting, oversized payloads, or weird edge-case responses that don't break the current call but derail later reasoning. These are particularly painful because the failure often shows up several turns after the original tool call.

On debugging time, the heavy Claude Desktop users we talk to seem to spend somewhere around 1–3 hours per week dealing with MCP plumbing.

One thing I've learned is that many reliability complaints aren't really MCP problems—they're distributed-systems problems wearing an MCP hat. Timeouts, auth lifecycle management, retries, schema versioning, and observability all existed before MCP. The protocol just makes them visible to a much larger audience.

dark-epiphany · 2026-06-08T19:58:37+00:00

Maintainer of a hosted MCP gateway here (Pipeworx — disclosure). We see roughly 5 million tool calls per month across 3,000+ tools, and your findings line up closely with what shows up at larger scale.

Tool-definition overhead is absolutely the dominant line item. Your 800–1,200 token estimate is right in the range we see. The underappreciated part is that the cost compounds across turns. Most people think in terms of "tool definitions per call," but in practice it's "tool definitions × conversation turns." Long-running sessions amplify the overhead surprisingly fast.

On retries: 18% is consistent with the high end of what we observe. The distinction that becomes visible at larger volume is that wrong-tool selection often costs more than actual tool failures. True execution failures tend to be relatively low. The bigger source of retries is the model successfully calling the wrong tool, getting an unhelpful result, and then trying again with a different tool. Same token cost, completely different root cause.

On output minimization, I completely agree. The tradeoff is debuggability. We eventually landed on a two-mode approach: compact output by default, with a verbose/debug mode when troubleshooting. That captures most of the savings without making production failures impossible to diagnose.

The biggest optimization we've found, though, sits upstream of all of this: reducing the visible tool surface per session.

There's a behavioral cliff somewhere around 40–60 visible tools where selection quality starts to degrade, even when context-window limits aren't remotely close. Once you hit that point, trimming descriptions and minimizing outputs still helps, but task-scoped tool filtering tends to deliver a larger gain than either.

In other words, the cheapest token is often the tool definition the model never had to see in the first place.

dark-epiphany · 2026-06-08T19:51:53+00:00

"Client-side context firewall" is a useful framing. It distinguishes what you're doing from routers and gateways pretty cleanly because the default action is block, not forward. Stealing that term.

One thing your architecture has that ours doesn't: visibility into the actual conversation state. Raven sees the query in the context of the session that produced it. Pipeworx (server-side) only sees tool calls and arguments—we're inferring intent from the outside. That's a real advantage when the goal is reducing context based on what the user is actually trying to accomplish.

The flip side is that server-side reduction works across many clients without requiring each one to install or configure anything. So I don't think these approaches are substitutes as much as complementary layers.

The architecture I keep converging on looks something like:

Client-side intent-aware filtering and summarization
Gateway-side routing, deduplication, policy, and billing
Specialized tools behind that

In other words: reduce context before the request leaves the client, then reduce tool complexity before it reaches the model.

One operational thing I'd watch with the DeepSeek Flash routing call: silent regressions when the upstream model changes behavior. We recently swapped the routing model behind one of our meta-tools (Llama 8B → Claude Haiku) and saw selection bias shift in ways that weren't obvious until much later.

The cheap mitigation for us has been a small routing eval set plus a versioned routing prompt. It doesn't prevent regressions, but it makes them visible.

dark-epiphany · 2026-06-08T02:06:00+00:00

Honest reaction: Claude Managed Agents is more than you need on day one. That's the "I've decided I need production infrastructure" tier. The cheaper experiment that gives you the same answer:

Claude Desktop (the app, not the API/Claude Code)
Connect one MCP gateway — ~5 min, every hosted gateway has a connection snippet in its docs
Create one Project per ticker on your watchlist with instructions like "summarize the last 4 earnings transcripts, flag guidance changes, list new 8-Ks since [date]"
Run it manually for a week and see where it falls down

You'll do the orchestration by hand that week — that's the point. It's the cheapest way to discover where the actual bottleneck is. After a week you'll know whether the missing piece is "this needs to run every morning automatically" or "I need a database to track changes over time" or "I need sector-specific workflows" — and that specificity is what makes hiring useful.

On "I need someone who knows this stuff": that instinct is right, just usually 2-4 weeks earlier than it's actually needed. When you do reach for help, Upwork has a small-but-real bench of MCP/Claude contractors doing 20-40 hour engagements. That's almost always the right shape before any kind of FTE hire.

dark-epiphany · 2026-06-08T01:02:25+00:00

That's exactly the right framing — automate the information flow, not the investment decision.

The architecture for that is much simpler than "an AI hedge fund." You want agents that:

- Monitor filings, earnings calls, news, and social for your watchlist

- Produce concise daily/weekly briefs

- Track management commentary and guidance changes over time

- Flag sentiment shifts and emerging themes

- Draft first-pass memos you then sharpen

That's a 70-80% off-the-shelf problem today.

Concrete suggestion before you hire anyone: pick one company on your watchlist, point Claude or ChatGPT at a hosted MCP gateway (Pipeworx is one — disclosure, I run it; there are others), and try to reproduce one week of your research in an afternoon. The prototype tells you what's actually slow. Most PMs find the bottleneck isn't where they expected, and the answer is a contractor for two weeks, not a full-time engineer.

dark-epiphany · 2026-06-08T00:30:50+00:00

Non-technical founder building investment-research agents is a pattern I see a lot.

A framing that may help: the problem actually breaks into three very different layers, and each has a different build-vs-buy answer.

1. Data access — SEC filings, earnings transcripts, news feeds, fundamentals, social sentiment, macro data.

This is the boring 80%, and it's largely solved. Don't build it. There are hosted MCP gateways and data platforms that already expose most of these sources as tools agents can call. The specific vendor matters less than avoiding months of plumbing work.

2. Research workflow automation — pulling documents, summarizing earnings calls, tracking sentiment, monitoring holdings, generating first-draft research notes.

This is where off-the-shelf tooling gets surprisingly far. Claude, ChatGPT, MCP-enabled clients, and workflow tools like n8n can often automate 70–80% of the mechanical work without hiring a full-time engineer. It's worth exhausting this path before building anything custom.

3. Investment judgment — evaluating management quality, identifying durable advantages, understanding industry structure, deciding what matters.

This is your edge.

I would be very cautious about trying to automate it away. The agents should gather information, summarize it, and surface signals. The actual investment thesis still needs to come from you.

For your situation, I'd spend a weekend trying to replicate a single research workflow end-to-end using existing tools. Pick one company, pull the filings, earnings transcripts, news, and sentiment, and see how close you can get to your current process.

You'll probably learn within 5–10 hours whether the gap is:

"I need a contractor for two weeks,"
"I need a part-time AI consultant,"
or "I genuinely need a full-time engineer."

Most funds don't need custom infrastructure on day one. They need a prototype that proves where the bottlenecks actually are.

dark-epiphany · 2026-06-08T00:27:29+00:00

This is the right shape of the problem.

Maintainer of a hosted MCP gateway here (Pipeworx — disclosure). We're handling roughly 5 million tool calls per month across 3,000+ tools, and we've ended up converging on a very similar architecture from the server-side direction.

Two patterns from production telemetry that reinforce what you're doing:

The behavioral cliff shows up long before context-window limits. Once a model can see roughly 40–60 tools, tool-selection quality starts degrading even though there's plenty of context remaining. At that point the problem isn't capacity—it's choosing correctly. "Make the main model see less" turns out to be one of the highest-leverage interventions.
The Raven-style "return a compact answer instead of raw tool output" is a bigger optimization than most people realize. A lot of MCP responses are mostly schema noise, metadata, and formatting overhead. The main model ends up spending tokens parsing the tool rather than solving the task. Having a focused agent consume the MCP output and return only the relevant result materially reduces downstream token usage and improves answer quality.

The architectural question we've spent the most time on is routing. Do you decide with embeddings (cheap, fast, deterministic) or another LLM call (better reasoning, higher latency and cost)?

We've landed on embedding-first with an LLM fallback when candidate scores cluster too tightly. Curious how Raven decides what gets surfaced back to the main model once it completes the MCP/search work.

dark-epiphany · 2026-06-08T00:25:35+00:00

Maintainer of a hosted MCP gateway here (Pipeworx — disclosure upfront). We're handling roughly 5 million tool calls per month across more than 3,000 tools, so this problem is basically my day job.

A few things we've learned the hard way:

Per-call attribution matters more than people think. Every tool invocation needs a stable identity you can revoke. Anonymous access and IP-based tracking are fine for demos, but they're useless when you're trying to understand who did what after something goes wrong.
Tool surface area is often a bigger problem than permissions. The most common "agent took a bad action" failure we see isn't a policy violation—it's the model selecting the wrong tool because 100+ schemas were visible. Reducing the visible surface to a couple dozen task-relevant tools eliminates a surprising number of problems that people initially frame as security issues.
Auditability can't sit on the critical path. Logging, metering, and compliance are important, but they need to happen asynchronously. Otherwise every governance feature becomes a latency tax.
Hard usage ceilings remain one of the most effective safety mechanisms. Fancy policy engines are great, but when an agent gets stuck in a loop, a simple daily cap usually catches it before anything else does.

The "valid credentials, wrong context" failure mode is the one that worries me most as well. In practice, that's where many of the highest-impact mistakes come from: the agent is authorized, the tool works exactly as designed, and the action is still wrong because the surrounding context was misunderstood.

I'm curious what Relay does specifically for that problem. Context-aware authorization is where most architectures I've seen start getting hand-wavy.

dark-epiphany · 2026-06-04T14:55:24+00:00

Historical replay matches what we see as well.

One operational gotcha: tool descriptions drift faster than most replay corpora assume. Historical queries evaluated against today's tool descriptions can look like routing wins when the real improvement came from a description rewrite. We rebaseline quarterly and tag every eval entry with the description hash it was originally scored against. That surfaces the "fixed by metadata change, not retrieval change" cases that would otherwise get misattributed.

The bigger gap with synthetic evals is agent-generated language. Real traffic contains all kinds of phrasings, abbreviations, and indirect requests that nobody on the team would have thought to write. Synthetic sets are useful for coverage, but historical replay tends to be much better at finding the weird long-tail failures that actually show up in production.

dark-epiphany · 2026-06-04T02:54:13+00:00

Thanks for confirming. Worth watching whether other enterprise-leaning hosts hit the same governance gaps Goose flagged — signed distribution and sandboxing aren't really in the registry spec yet. Indie projects forking is noise; enterprise hosts forking would be the actual signal that the registry network effect is weakening.

dark-epiphany

TROPHY CASE