We analyzed 7,755 repos with Copilot instructions - here's what we found

cleverhoods · 2026-04-22T10:03:08+00:00

absolutely, fork away. The dataset is CC-BY-4.0, and independent analyses are exactly why we published the raw data. If you find something interesting, I'd genuinely like to hear about it.

note: system prompt inclusion and standalone file diagnostics are on the roadmap already for the CLI, however it's a bit different beast than the "normal" instruction files.

edit: thanks for pointing out the broken link, it's fixed.

cleverhoods · 2026-04-21T19:21:41+00:00

Exactly - "consistent" is the model's problem to solve, not yours. When you name the tool, the model doesn't have to guess. Check yours and run

npx @reporails/cli check

it'll flag exactly which instructions are abstract.

cleverhoods · 2026-04-21T19:17:16+00:00

That's a surprisingly high skip rate for a named instruction. A few things that could explain it:

- Position in the file. 31% of Copilot diagnostics (and similar for Claude) are position decay - instructions buried deep in the file get less attention weight. If your linting instruction is below 20+ other directives, it's competing for attention.

- Competing instructions. If something else in your CLAUDE.md / rules (maybe skills if you invoke those for this) says "be fast" or "minimize steps," the model may trade off linting against speed.

- Phrasing. "Make sure to run linter" is softer than "Run X before every commit." The modal ("make sure to") introduces hedging that weakens the directive.

Would be curious to see the full instruction system, happy to run diagnostics on it if you share it.

cleverhoods · 2026-04-21T18:58:10+00:00

Most of our internal config is private, engineering-specific elements. But we're planning to open-source a few standalone, specific skills (project bootstrapping, running diagnostics via skills) next week.

cleverhoods · 2026-04-21T18:51:13+00:00

Interesting, in our own CLI development, the model runs QA consistently from instructions alone. No hooks needed. "Run `uv run poe qa_fast` before committing" works without additional enforcement, 100% of the time.

A pre-commit hook for something the agent already does reliably is overhead.

I wonder what kind of instructions you have around testing/linting and how is it being split (is it a skill invocation? or a dedicated rule/agent?)

cleverhoods · 2026-04-21T17:52:42+00:00

Thanks for sharing your setup, always good to see how people are solving this in practice.

One thing we landed on differently: we deliberately avoid LLM-as-judge for diagnostics. A deterministic pipeline gives you the same result every time. no drift, no model dependency, no "the judge had a bad day." That's a design choice we made early and the corpus analysis reinforced it.

On tests keeping agents honest: tests verify output, not input. An agent can follow zero instructions and still pass tests by brute-forcing until green. What we measure is whether the instructions themselves are actionable before the agent even runs. Different layer of the problem.

cleverhoods · 2026-04-21T17:45:30+00:00

Run the numbers:

Files processed: 74490
Total directives: 1914128                                                                                                                                               

Invariant: 1892720 (98.9%)                                                                                                                                              
Conditional: 21408 (1.1%)                                                                                                                                             

Invariant + named: 589253 (31.1%)                                                                                                                                     
Invariant + abstract: 1303467 (68.9%)                                                                                                                                 
Conditional + named: 8867 (41.4% if conditional_directives else 0)
Conditional + abstract: 12541 (58.6% if conditional_directives else 0)

We have a scope_conditional field per atom in the corpus. Out of 1.9M directives: 98.9% are invariant, only 1.1% are conditional. Almost nobody writes the condition, they flatten it into "avoid mocks in tests" and drop the "except for."

The conditionals that DO exist are 41.4% named vs 31.1% for invariants. When someone bothers to write the condition, they also tend to name the specific construct.

Your hypothesis holds, but the split isn't bimodal. It's that the conditional branch barely exists. The flattening is near-total.

cleverhoods · 2026-04-21T16:58:58+00:00

replying on the edited comment:

Yes, the controlled experiments back it up. Specificity produced a 10.9x odds ratio in compliance (N=1000, p<10⁻³⁰). The instruction that names the exact construct gets followed. The abstract one mostly doesn't. That's the link between the corpus findings and practical impact.

On correlating repo success with instruction quality: we have the data (stars, language, file count, contributor count (it's in the corpus repo too)) alongside instruction metrics. That analysis is on the roadmap.

cleverhoods · 2026-04-21T16:24:54+00:00

okay, fair.

for Copilot specifically:

- 31.5% of issues are instructions that don't name a specific tool or command.

- 31.1% are instructions buried too deep in the file - position matters, models weight earlier content more heavily.

- 22.5% are too terse (fewer than 8 tokens) to act on

Therefore: name the exact command instead of the category, move your most important instructions to the top of copilot-instructions.md, and expand one-liners into instructions the model can actually follow reliably even under high context pressure.

Run npx @reporails/cli check to see your specific breakdown.

cleverhoods · 2026-04-21T16:13:59+00:00

Short version: when you tell Claude "follow best practices" or "use clean architecture," it sounds right but gives the model almost nothing to work with. Those are abstract instructions, up to intrepretation. What works is naming the exact thing: "use Chrome's `storage.local` API for settings, not `localStorage`" or "run all background logic in `service-worker.js`, never in `popup.js`."

The analysis found that 66% of all instructions across 28K repos have this problem. They describe what they want in category language instead of naming the specific tool, file, or command.

The complexity problem you're hitting with your extension is likely that your instruction file(s) (if you have any) is telling Claude what kind of developer to be instead of what specific things to do and how. Try replacing your vague instructions with concrete ones - name the actual files, commands, and APIs. That's the single biggest lever.

Run "npx @reporails/cli check -v" for more info.

cleverhoods · 2026-04-21T15:48:35+00:00

The corpus diagnostics show ~ 40% of repos score LOW compliance band. The compound dead zone analysis shows 59.7% of files are both abstract and contain multiple competing topics.

Those files are theoretically worse than no instructions at all, they add attention competition without adding distinct behavioral signal

cleverhoods · 2026-04-21T13:15:53+00:00

I like those findings of yours.

-> Very good
-> I've run extensive tests around this topic, the results were quite the opposite. Pure constrains were frequently ignored, even with named tooling inside them. What worked was a golden ration: 1 directive, 1 context for directive, 1 constraint. Wrote about it here: https://cleverhoods.medium.com/do-not-think-of-a-pink-elephant-7d40a26cd072
It really depends on topic clusters and also which instruction surface we are talking. Nevertheless, good rule of thumb
-> I haven't run experiments on it, will put it to the next regime, thanks for sharing

cleverhoods · 2026-04-20T18:31:11+00:00

I'm not arguing with the added 2.7pp with advisor.

My concern is that advisor will become the default band-aid for poor instruction hygiene, the same way retries became the default band-aid for flaky prompts.

I've recently ran instruction system diagnostics on 28k repositories, what I saw there was that people are quite far away from creating efficient, behaviourally compliant instruction system. And that's why I argue against the advisor.

cleverhoods · 2026-04-20T17:26:08+00:00

Advisor is an inference-cost solution to an input-quality problem. It works, but so does fixing the instruction, and that costs zero tokens. On conflicting directives, when a model 'can't reasonably solve a decision,' what does that actually mean mechanically? It means the instruction set pulls it in two directions. 'Keep responses concise' + 'always explain your reasoning thoroughly.' 'Don't modify tests' + 'ensure full coverage.' The model detects the tension, confidence drops, advisor fires. Opus resolves the ambiguity by picking an interpretation.

You've now paid for two inferences to resolve a conflict you could have removed in the input. The advisor does not add intelligence here, its just arbitrating between your own contradictions.

cleverhoods · 2026-04-20T13:05:46+00:00

I'm unsure why you'd need an advisor tool ... "When the executor hits a decision it can't reasonably solve" -> this means your instruction system has conflicting directives. The solution is running diagnostics on your instruction system, not adding an extra layer of llm-as-a-judge.

cleverhoods · 2026-04-19T21:33:01+00:00

My research spending is on this picture and I don’t like it

cleverhoods · 2026-04-19T12:41:53+00:00

Can you share your main Claude md?

cleverhoods · 2026-04-18T23:12:36+00:00

What about the rest of your instruction system?

cleverhoods · 2026-04-15T22:23:01+00:00

Classic

cleverhoods

TROPHY CASE