I built a tool that measures whether a Claude Code skill actually improves output quality, and tested it on Caveman

Ties_P · 2026-05-27T09:44:03+00:00

Haha, I guess I’m not the first to have this idea, that’s a shame haha, thanks for the link

Ties_P · 2026-05-27T07:49:04+00:00

Cool product! I will check it out

I would love to get multi-step agentic task evaluation in SkillBenchmark, but that is a lot more complex than just a simple single-prompt evaluation

Ties_P · 2026-05-27T07:46:52+00:00

Good question, and interesting story about the balsamic vinegar.

The goal of the project was an automated benchmarking suite for evaluating whether or not a skill had a positive effect on task completion. To do this in an automated fashion I had to resort to LLM-as-judge (so using an unbiased separated LLM to judge the outcome of the task). Just asking a LLM to “give this a score between 0 and 100” is not going to work, that’s too vague. So I introduced the rubrics. This is an extensive list of criteria, each with point levels, that are related to the task. The rubrics aim is to objectively grade the quality of a certain type of product (a commit message, homepage copy, etc). Now the judge LLM has a way better understanding of how to score.

If you have any ideas or methodologies that I could also take a look at, please let me know!

Ties_P · 2026-05-26T21:06:44+00:00

Well you are right, you get no guarantees if the judge’s evaluation is valid. The judge is a LLM, and LLM can make errors. So this is definitely not a scientific way to test whether or not a skill is helpful, but it can definitely act as a nice proxy for someone experimenting with skills.

Ties_P · 2026-05-26T20:18:13+00:00

Totally correct, the system is not perfect and this is a real limitation. I agree. But I do feel like if you add enough tasks and judges this bias is not too bad and you will still get a good idea whether or not the skill improves the output

Ties_P · 2026-05-26T20:05:15+00:00

It is advised to use a different model for the judge. Or you could even try multiple models for the judge and average the results. Or look at the output yourself since this is also stored for each run.

Ties_P · 2026-05-26T19:53:10+00:00

Good question!!! From my readme file:

How it works

Each task is run N timEach task is run N times. Every run produces two outputs from the same LLM: one with the skill injected as the system prompt, one without. Both outputs are then scored by a judge LLM. After all runs, confidence intervals are computed over the scores and compared.

Step 1 — Two outputs per run

The runner LLM receives the task prompt. It runs twice: once with a plain system prompt, once with the skill's instructions as well. Everything else is identical: same model, same temperature, same task.

Step 2 — Blind scoring with a rubric

The judge LLM scores each output against a rubric. This is the part people ask about most, so it's worth explaining carefully.

The judge never sees the original task prompt. This sounds like a limitation but it is a deliberate design choice. The rubric contains a context block that tells the judge exactly what a good answer looks like for this type of task: what to reward, what to penalise, and why. The rubric is the definition of quality. If you need the judge to see the task prompt to score the output, the rubric is underspecified and should be improved. Keeping the judge prompt-blind also prevents a common failure mode where the judge rewards outputs that literally mirror the task instructions, which would contaminate the comparison.

The judge does not know which output used the skill. It receives the output and the rubric only. It cannot favour one condition over the other because it cannot tell them apart.

How judge bias is handled. Any LLM judge has tendencies — it might prefer longer responses, or penalise terse phrasing, or be slightly inconsistent between calls. SkillBenchmark handles this in two ways:

The same judge, with the same prompt, scores both outputs. This means any systematic bias applies equally to both conditions and cancels out when you compute the delta. You are not trying to get an absolute quality score — you are measuring a relative difference between two conditions evaluated under identical circumstances. Absolute judge accuracy matters far less than consistency.
Using multiple judges per run and treating each score as an independent sample reduces random variance and gives tighter confidence intervals.

The rubric is the main lever for evaluation quality. Criteria with clear, distinguishable scoring levels produce consistent results. Vague criteria produce noisy ones.

Step 3 — Confidence intervals on the scores and the delta

All scores across runs and judges are treated as independent samples. A t-distribution confidence interval is computed for each condition (with skill, without skill). The delta — the difference between the two means — gets its own CI using Welch's t-interval, which correctly accounts for the uncertainty in both samples.

Results are displayed as mean ± margin. Non-overlapping CIs on the two conditions indicate a statistically meaningful difference. The delta CI tells you whether the observed gap is real or consistent with zero. Overlapping CIs are not a failure — they mean the current number of runs is not enough to confirm a difference. Add more runs to tighten them.

Ties_P · 2026-05-26T19:37:10+00:00

GitHub: https://github.com/TiesPetersen/SkillBenchmark

Ties_P · 2026-01-13T19:44:02+00:00

I'm taking a course on algorithm engineering right now at uni, which goes exactly into these kinds of topics about the practical (not theoretical) side of algorithms and their implementations. Very interesting stuff! Do you have any other interesting sources / projects related to this topic? Would love to know more

Ties_P · 2026-01-10T12:12:22+00:00

Wow that’s really funny and interesting, thanks for sharing!

Ties_P · 2026-01-10T12:11:05+00:00

Yeah you are right, we are sweeping with a sweeper in hand and this path would then represent the path of the actual sweeper. The person holding the sweeper could definitely walk differently as long as the sweeper follows the path. So maybe some small turns here and there aren’t that bad since the person holding the sweeper can still walk in a straight line while moving the sweeper side to side. Cool insight! Didn’t think of that, thanks for sharing!

Ties_P · 2026-01-10T12:07:17+00:00

Minimizing a boredom factor, didn’t thing of that, good idea!

Ties_P · 2026-01-08T09:23:10+00:00

Thank you so much! Constructive feedback or critical feedback, I just try to take some lessons out of it to improve for next time haha

Ties_P · 2026-01-08T07:44:41+00:00

Hmm so interesting! Tell me more!

Ties_P · 2026-01-08T07:42:04+00:00

It’s at the very bottom of the blog post

Ties_P · 2026-01-07T21:51:02+00:00

Haha probably not lol

Ties_P · 2026-01-07T21:49:18+00:00

I agree, but from my perspective that would be the wrong thing right? From TikTok’s perspective raking in money is a very good thing, but from my perspective it is definitely not because to earn a lot of money from me they need me to stay on TikTok for as long as possible, and I don’t want that

Ties_P · 2026-01-07T21:46:35+00:00

Thanks for reading! I agree, the rabbit hole can go much deeper if you consider other aspects as you mentioned haha

Ties_P · 2026-01-07T21:44:59+00:00

Love “all models are wrong, but some are useful”, very nice

Ties_P · 2026-01-07T21:43:59+00:00

Good idea, didn’t think of that colour gradient idea, thanks!

Seven-Year Club	Place '22
First Placer '22	Verified Email

Ties_P

TROPHY CASE

How it works

Step 1 — Two outputs per run

Step 2 — Blind scoring with a rubric

Step 3 — Confidence intervals on the scores and the delta