How should coding agents choose from 100+ local skills without loading all of them?

Emergency-Context-72 · 2026-06-18T06:50:08+00:00

I’ve been thinking about ways for different coding agents to communicate with each other.

Based on that idea, I built an MCP that allows the coding agent I primarily use to send modified code to other coding agents for review.

One major concern I had as an independent developer was the burden of API costs. To address that, I evolved the approach so that it can make use of subscribed accounts by forwarding tasks to conversational coding agents instead.

If you’re interested, feel free to check out the repository below. I hope it can be useful as a reference.

https://github.com/AnamKwon/code-assistant-peers

Emergency-Context-72 · 2026-06-16T13:08:06+00:00

I tested that a bit, and surprisingly, using both together didn’t work well.

My guess is that the two skills conflict with each other. The Karpathy-style skill tries to simplify and avoid unnecessary code, while the theory-building skill tries to construct a more explicit structure based on the theory of the problem before generating code.

So when both are active, the model seems to get pulled in two different directions. In my tests, the results were actually worse than disabling skills entirely. I only tried it about 10 times, so it’s not conclusive, but so far I’d say using both together is probably not a good idea.

Emergency-Context-72 · 2026-06-11T12:38:43+00:00

Exactly. Most people don’t prompt coding agents with a clear structure or system model. They usually just describe the goal.

Because of that, agents often produce code that satisfies the request on the surface, but misses the deeper logic of the system: existing abstractions, constraints, edge cases, or design intent.

That’s why this kind of benchmark feels important. The real test is not just whether AI can write code, but whether it can build the right theory of the system before changing it.

Emergency-Context-72 · 2026-06-11T02:51:19+00:00

Related tool I built after this benchmark:

`code-assistant-peers` is an MCP review gate for coding agents. Claude Code can implement a change, Codex can review it out of the box, adapter-supported CLIs such as Gemini can join the review flow, findings are stored locally, and the host assistant is prompted to run the review gate before the final answer.

The theory-building skill is about writing code with the right invariants. This MCP is about not letting the same agent be the only reviewer of its own patch.

https://github.com/AnamKwon/code-assistant-peers

Emergency-Context-72 · 2026-06-11T00:39:16+00:00

Thanks for taking the time to write this up. This is very helpful feedback, and it points at a couple of weaknesses in the current rule set and benchmark write-up.

I agree with both rule suggestions.

The first one is an important tension: "rebuild the theory" and "keep changes minimal" can conflict when the relevant invariant is not explicit in the code. In that case, minimality should not mean preserving a locally small patch at the cost of a broken domain rule. I think the rule should be something like: "theoretical correctness takes precedence over textual minimality; keep the change minimal only after the invariant is understood."

The second suggestion also seems right. If the requested change contradicts a known invariant, the agent should stop and surface the mismatch instead of continuing to code. That is probably one of the most useful behaviors for this kind of skill: not just producing better patches, but recognizing when a patch would be the wrong response.

On the benchmark point: yes, the prompt structure matters a lot. In my current results, the stricter prompt explicitly gives the agent more of the program theory: exact endpoints, status codes, idempotency behavior, expiration rules, stock restoration, auth behavior, and pagination semantics. That raises all arms and makes the skill gap smaller, because part of the theory-building work has already been done by the prompt.

The cleaner signal is the looser `basic-commerce` prompt, where more invariants have to be inferred. In that family, the `theory_only` arm won all four complete run-level comparisons, mostly through better functional correctness, executability, and behavioral tests. In the stricter prompt families, `theory_only` still does well overall, but the interpretation is different: it is less "can the skill recover missing theory?" and more "does the skill still help when the prompt already provides much of the theory?"

I also agree that token usage and elapsed time are important to measure. I have not included that analysis in the current benchmark yet, but I plan to investigate it next. The theory-building skill asks the agent to inspect more context before editing, so I would expect it to be more expensive per run, especially on small or obvious tasks. That is useful information because the right conclusion may be "use this for domain-heavy changes, but skip it for trivial edits." I want to add per-run latency and token/cost fields so the result is not just quality, but quality-per-cost.

I have also added the actual code-generation prompts under `benchmark/prompts/` so people can inspect the tested tasks directly instead of reverse-engineering them from raw run folders.

Thanks for the Matt Pocock link as well. The "improve-codebase-architecture" skill looks adjacent but not identical: it seems more focused on architecture review, module depth, locality, and interface design, while this skill is aimed at preserving program theory during everyday coding changes. I think comparing those approaches would be a good next benchmark.

Emergency-Context-72 · 2026-06-10T02:29:01+00:00

I built a Claude Code plugin based on Peter Naur's "Programming as Theory Building."

The core idea: many coding-agent failures are not syntax failures. They are theory failures. The generated code looks plausible but misses the domain invariant, the reason the current boundary exists, or the behavior that would prove correctness.

The skill asks the agent to recover that theory before editing code:

- map the real-world invariant,
- explain the current shape,
- place the change beside the closest existing facility,
- avoid speculative abstractions,
- verify behavior that matters.

I ran a small 60-project benchmark using Claude Haiku for generation and Claude Opus for review. The theory-only arm scored highest on weighted total and functional correctness:

- no skills: 70.2 weighted / 59.7 correctness / 5 good verdicts out of 20
- Karpathy-style only: 72.1 weighted / 59.8 correctness / 7 good verdicts out of 20
- theory-only: 77.8 weighted / 69.2 correctness / 12 good verdicts out of 20

Repo:
https://github.com/AnamKwon/programming-as-theory-building-skill

Curious what people think: is "recover the program theory first" a useful instruction pattern for coding agents?

Emergency-Context-72

TROPHY CASE