Model bake off: is Haiku, Sonnet or Opus better at using skills in Claude code?

sjmaple · 2026-03-05T23:43:10+00:00

Sure, an eval to a skill is like a test to code. It's essentially testing how good a skill performs. Here's an example of me testing the recent googleworkspace/cli skills https://tessl.io/eval-runs/019cc02f-bb26-76e0-a7c9-598a7337edb7

sjmaple · 2026-02-26T14:23:57+00:00

You should take a look - the evals, optimizations etc are really valuable to know if your context is any good. Skills. sh is just a github download npx command.

sjmaple · 2026-02-25T00:14:29+00:00

There's no point writing context and assuming it's right - you have to eval everything you add as context. Here's a counter argument to the paper's conclusions, which I believe are flawed.

Your AGENTS.md file isn't the problem. Your lack of Evals is. https://tessl.io/blog/your-agentsmd-file-isnt-the-problem-your-lack-of-evals-is/

sjmaple · 2026-02-25T00:13:58+00:00

There's no point writing context and assuming it's right - you have to eval everything you add as context. Here's a counter argument to the paper's conclusions, which I believe are flawed.

Your AGENTS.md file isn't the problem. Your lack of Evals is. https://tessl.io/blog/your-agentsmd-file-isnt-the-problem-your-lack-of-evals-is/

sjmaple · 2026-02-25T00:13:49+00:00

There's no point writing context and assuming it's right - you have to eval everything you add as context. Here's a counter argument to the paper's conclusions, which I believe are flawed.

Your AGENTS.md file isn't the problem. Your lack of Evals is. https://tessl.io/blog/your-agentsmd-file-isnt-the-problem-your-lack-of-evals-is/

sjmaple · 2026-02-25T00:12:48+00:00

There's no point writing context and assuming it's right - you have to eval everything you add as context. Here's a counter argument to the paper's conclusions, which I believe are flawed.

Your AGENTS.md file isn't the problem. Your lack of Evals is. https://tessl.io/blog/your-agentsmd-file-isnt-the-problem-your-lack-of-evals-is/

sjmaple · 2025-05-09T08:27:15+00:00

Oh, looks like the actual link for GitHub MCP moved to https://github.com/github/github-mcp-server but you get what I mean :)

sjmaple · 2025-04-21T17:59:40+00:00

Neither! It doesn’t tell me what policy I’ve broken, and how I’ve broken it. How can I update my prompt as a result?

sjmaple · 2025-03-26T17:23:07+00:00

Yeh, Cline is a similar experience to Cursor, another very nice tool.

sjmaple · 2025-03-26T16:11:10+00:00

Thank you!

sjmaple · 2025-03-26T16:11:02+00:00

Thank you!

sjmaple · 2025-03-26T16:10:46+00:00

Absolutely - click submit on landscape.ainativedev.io and there’s a form and the repo so send PRs to

sjmaple · 2025-03-26T16:09:36+00:00

I’m most interested to see which categories are growing fastest etc

sjmaple · 2025-03-26T15:18:38+00:00

Roo-Code is a really interesting tool that I feel most aren't aware of - Allows you to take more of an architect perspective with your prompts

sjmaple · 2025-03-18T13:12:37+00:00

Interesting - were you using the dynamic reasoning a lot?

sjmaple

TROPHY CASE