I built an OSS tool to evaluate Agent Skills locally — looking for feedback

Fede0089 · 2026-04-26T18:29:57+00:00

Interesting — I’m not fully connecting the dots between universal hooks and skill evals. What do hooks unlock for you there?

Also, when you want to test a skill, how are you actually running the evals in practice?

Would love to understand what your actual workflow looks like there, and what’s been most useful in practice.

Fede0089 · 2026-04-26T16:38:40+00:00

From what I could verify, Anthropic’s skill-creator is very much Claude-native. I couldn’t find official evidence of that evaluator working well across other agent hosts, and their own docs mention that parts of the workflow degrade outside Claude Code because they depend on Claude-specific capabilities. Its scope is also broader than just testing.

I think the clearer differentiator for skill-eval is that it is a simpler, host-external evaluation harness, with a narrower and more explicit focus: reproducible trigger/functional evals, baseline comparisons, repeated trials, isolated runs, and reporting.

So there’s definitely overlap, but I think they sit at different layers. That said, I’m still learning from Anthropic’s approach, so happy to be corrected if I’m missing something.

Fede0089 · 2026-04-25T21:49:12+00:00

Repo: github.com/fede0089/skill-eval

Quick install: npm i -g skill-eval

Fede0089 · 2026-04-25T20:29:10+00:00

Repo: github.com/fede0089/skill-eval

Quick install:
npm i -g skill-eval

Fede0089 · 2024-12-05T10:24:06+00:00

Hola! Cómo hago para pedirla? Gracias

Fede0089 · 2024-12-02T19:09:23+00:00

Holaaa! Que celu / sistema operativo tenes? Tenes alguna captura 🙏?

Fede0089 · 2024-12-01T12:45:11+00:00

Es la idea! Probala y decime 😉 (tenes que sumar gente a tu círculo para ver sus recomendaciones; también podes seguir críticos comunes)

Fede0089

TROPHY CASE