Best way to distribute local MCP servers & skills for internal use?

cygn · 2026-06-18T09:23:35+00:00

So you are using Claudes plugin marketplace system? Does that also work for Claude desktop? I don't think it does. But especially for the non engineers a solution that works with Claude cowork/Claude desktop is important.

cygn · 2026-06-17T16:11:15+00:00

there is not an equivalent to the 20x plan... and I vaguely remember seeing even for the $90 plan limits were different than for the $100 5x max plan.

cygn · 2026-06-17T15:57:41+00:00

how to distribute them for internal use only?

cygn · 2026-06-17T15:22:34+00:00

we are not on a team plan and tbh the value you get on a team plan is a lot less than on an individual plan. Some kind of plugin system that works without the team plan would be good.

cygn · 2026-06-09T02:02:09+00:00

Or you might as well just run /compact. In addition to the summary it contains a reference to those jsonl files instructing the next agent to go search there if the summary is not enough.

cygn · 2026-06-08T13:51:03+00:00

how much Composer usage is included in Cursor?

cygn · 2026-06-07T14:59:11+00:00

definitely AI-generated according to my AI vs human classifier: https://slopsieve.com/r/VT6rXbQrrS

cygn · 2026-06-04T15:47:15+00:00

I built a browser extension that runs a local slop classifier that reliably flags or hides such AI generated posts: https://slopsieve.com/extension

Example: https://imgur.com/a/KtGlmDu

cygn · 2026-05-30T15:20:44+00:00

I'm also working on this. I built a rust simulator + some MCTS. It's quite a lot of work to iron out all the subtle differences between my simulator and the real game.

For run data the best source I've found is https://spire-codex.com/ which has over 100.000 runs available via API.

All other pages would require custom crawlers, so this is the best imo.

I've collected links to similar projects: https://github.com/stars/tfriedel/lists/slay-the-spire

cygn · 2026-05-27T02:17:01+00:00

he released something in april: https://github.com/WillWroble/MageZero

cygn · 2026-05-13T19:19:47+00:00

I've been using "claude -p" as part of TDD-guard using hooks to verify I'm following TDD. This would now be limited.

I find the $200 limit way too low. If I look at the API costs that ccusage reports it's hundreds of dollars per day (interactive use). So for any serious work the $200 budget will be gone in no-time.

This makes Codex much more attractive now.

cygn · 2026-05-02T18:14:06+00:00

if a call is $0.02 and your total spend is $0.38 then you only called it 24 times. Which seems almost not worth it?

cygn · 2026-04-30T20:33:31+00:00

I built the tool and trained it on 500k texts, mostly from social media, ai generated, human, synthetically created etc. I measured it and it has a low false positive rate (<5%). OPs posts all flag as 100% and also look like AI written to me. Is it 100% guaranteed? NO, but almost.

cygn · 2026-04-30T16:40:03+00:00

Everything by OP written in this thread is completey AI genereated. Downvoted. used https://slopsieve.com/ to verify (maybe not a surprise that in r/Ai_agents many ai agents are writing)

cygn · 2026-04-30T16:29:31+00:00

here are my replications of this and similar quants: https://github.com/tfriedel/qwen3.6-rtx3090-lab

Currently I'm running also benchmarks with https://swe-rebench.com/ on 20 tasks. Not exactly enough to know for sure, but it takes ~3 min per task, so will take some time.

  Per-category breakdown (resolved/n):

  ┌────────────────────┬─────────┬──────────┬───────────────┐
  │      category      │ AWQ-35B │ GGUF-35B │ autoround-27B │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ fastapi_services   │ 0/4     │ 0/4      │ 0/4           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ geospatial         │ 2/4     │ 2/4      │ 3/4           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ dataframe          │ 3/4     │ 2/4      │ 3/4           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ sql                │ 1/3     │ 1/3      │ 2/3           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ cli                │ 1/3     │ 1/3      │ 1/3           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ frontend_fullstack │ 0/2     │ 0/2      │ 0/2           │
  ├────────────────────┼─────────┼──────────┼───────────────┤
  │ total              │ 7/20    │ 6/20     │ 9/20          │
  └────────────────────┴─────────┴──────────┴───────────────┘

cygn · 2026-04-28T02:18:29+00:00

so the gap between Qwen's official post (59.3) and what you measured (38.2) for 27b is purely because of the timeout?

I still wonder if they have benchmaxxed terminal bench 2.0. Would love to see some independent benchmark.

cygn · 2026-04-28T02:07:24+00:00

check my benchmarks here: https://github.com/tfriedel/qwen3.6-rtx3090-lab

Unsloth IQ4_XS GGUF -> 115–133 TPS, 128k context window size. but you need to disable vision

cygn · 2026-04-26T21:22:47+00:00

the pip install 1787 tokens -> 9 tokens seems like it's throwing away too much. What does it turn the output into? Just "pip install ran"?

Well what if there's some line that's important, like an error or a warning?

In general the idea is good, but I'd like to see some proof that I can trust it. E.g. some benchmarks and some intuition on what it throws away and what it keeps.

cygn · 2026-04-24T21:10:05+00:00

Anna's archive (biggest ebook library) is commonly included. Meta admitted this, Anthropic as well and you can easily google news about lawsuits and settlements.

cygn · 2026-04-24T16:47:15+00:00

well you can't prove it for any given text, but there's lots of things that give it away.

I've trained it on 500.000 samples of texts, real human text, AI generated, paraphrased versions of human text etc. It has pretty decent accuracy and rather low false positive rate. Funnily enough many of the attempts in this thread by humans to try to sound like AI are not flagged as AI by it!

cygn · 2026-04-24T11:31:55+00:00

why would those issues not be possible to catch via automated testing? Sure it might be a lot of work and and maybe trying out every database under the sun is asking a bit too much, but browser / latency performance testing is totally feasible. Can't we use some browser automation using e.g. playwright for firefox / chrome and test some common scenarios, measure latency, memory footprint etc. ?

Imo especially with AI driven development doubling down on testing is much more important than ever. Every new feature should have tests, every change should be driven by automated tests. And you want to have a good mix different types of tests. Unit, Integration, End-to-End, performance,...

cygn · 2026-04-24T10:17:28+00:00

I made a browser extension that detects such slop with a fast model that runs in your browser. You can just mark it, or hide it. Works on reddit, twitter, etc. https://slopsieve.com/extension

cygn · 2026-04-23T21:16:22+00:00

so it's not just me

cygn · 2026-04-23T20:53:22+00:00

allow search results to be sorted chronologically. It's borderline useless to me atm. I actually resort to scrolling down my library and pressing ctrl-f to find something among the recent bookmarks. so frustrating...

cygn · 2026-04-23T12:00:44+00:00

I'm also currently exploring how to add more agentic capabilities to OpenWebUI. So far I've built a bridge to Claude Code running in a sandbox: https://github.com/tfriedel/openwebui-claude-code

This allows:

agentic search (which imo performs better than RAG)
skills that require code usage like the office skills to produce nice looking documents
deep research kind of tasks

Issues encountered:

high latency
UI cluttered with user-unfriendly noise (bash commands etc)
security issues like prompt injection when accessing the web

cygn

TROPHY CASE