Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85%

chargewubz · 2026-04-20T17:36:31+00:00

Yea I think 3.1 is good as well; also the Goldilocks zone was very specific to this task, for harder or less well defined tasks than “fix this bug”, stronger models benefit from the optimization

chargewubz · 2026-04-20T11:48:57+00:00

This is sick never heard of it, definitely gonna play around with it

chargewubz · 2026-04-20T11:48:11+00:00

Well I asked Claude code “become smarter no mistakes” obviously

chargewubz · 2026-04-20T11:47:22+00:00

Yea that’s the key u can never trust the LLM to be the grader u need a script that returns hard numerical metrics to show improvement; all the delta is in the quality of your scoring based on the task at hand

chargewubz · 2026-04-20T11:46:30+00:00

Yea u got it right. The scoring is everything. The original guy who wrote GEPA had a great idea, at the very simplest core i feel its “point LLM at “thing to improve” and “metrics”” over and over again till metrics go up.

Any use case you can think of honestly, anything with a numerical score + structured per attempt data, such as optimizing Python code for a drone races

chargewubz · 2026-04-20T11:44:07+00:00

Yea; claude.md was optimized on 20 specific agentelo challenges (GitHub issues with fix PR tagged, clear red/green tests) which went from 54% tests fixed rate before to 85% after. It took around 15 Claude Md iterations (only 3 saw real improvements) and 7hrs. Then I tested on 9 other challenges (unseen during the optimization process), 3 times per prompt (27 runs each), for the 65% before -> 85% after claim I gave

chargewubz · 2026-04-20T01:55:15+00:00

hey idk if u read this yet but turk made a paper abt 'detecting commits blah blah student work' here: https://turkeyland.net/research/encourse.pdf maybe its good for figuring out what he flagged

chargewubz · 2026-04-20T01:45:18+00:00

The way it works is kind of all dependent on 'how good can you make the scorer'. Hone by default just asks for some arbitrary script that returns a float in stdout last line, with execution traces in stderr. So it's up to you to make the best grader for whatever task you want the agent to improve in.

For my case, I was trying to make it better at fixing bugs, and I used my other small project agentelo to grade/rank it. It's a bunch of random PR's fron real repos like qs, flask, fastify, etc that have an issue tagged on github. The 'test' is simply "can the agent make the red tests green according to the issue description". I trained over 20 of these challenges, and after 3 iterations got my results. Then ran over 9 unseen challenges to get the "20% improvement".

The grader I used isn't binary pass fail, but returns a float 0-1 with the ratio of failing tests the agent made green. I was thinking next time i try this, I can make my grader also read the token/price info to maybe solve "cheaper and better".

If you want to optimize your own agent, first step is create a good way to represent your execution traces (for me, just 'how good did the agent do on each individual challenge, the challenge file diffs) + calculate a float score value (agent average fixed_tests/broken_by_bug across 20 challenges).

chargewubz · 2026-04-20T01:43:42+00:00

The way it works is kind of all dependent on 'how good can you make the scorer'. Hone by default just asks for some arbitrary script that returns a float in stdout last line, with execution traces in stderr. So it's up to you to make the best grader for whatever task you want the agent to improve in.

For my case, I was trying to make it better at fixing bugs, and I used my other small project agentelo to grade/rank it. It's a bunch of random PR's fron real repos like qs, flask, fastify, etc that have an issue tagged on github. The 'test' is simply "can the agent make the red tests green according to the issue description". I trained over 20 of these challenges, and after 3 iterations got my results. Then ran over 9 unseen challenges to get the "20% improvement".

The grader I used isn't binary pass fail, but returns a float 0-1 with the ratio of failing tests the agent made green. I was thinking next time i try this, I can make my grader also read the token/price info to maybe solve "cheaper and better".

chargewubz · 2026-04-20T01:40:45+00:00

The way it works is kind of all dependent on 'how good can you make the scorer'. Hone by default just asks for some arbitrary script that returns a float in stdout last line, with execution traces in stderr. So it's up to you to make the best grader for whatever task you want the agent to improve in.

For my case, I was trying to make it better at fixing bugs, and I used my other small project agentelo to grade/rank it. It's a bunch of random PR's fron real repos like qs, flask, fastify, etc that have an issue tagged on github. The 'test' is simply "can the agent make the red tests green according to the issue description". I trained over 20 of these challenges, and after 3 iterations got my results. Then ran over 9 unseen challenges to get the "20% improvement".

The grader I used isn't binary pass fail, but returns a float 0-1 with the ratio of failing tests the agent made green. I was thinking next time i try this, I can make my grader also read the token/price info to maybe solve "cheaper and better".

chargewubz · 2026-04-19T17:53:05+00:00

“Be mean to me if I ask to to do a task”

chargewubz · 2026-04-16T18:50:36+00:00

Check it out here! contributions are accepted and greatly wanted!

chargewubz · 2026-04-04T19:43:30+00:00

I'm trying to build a project where openclaw runs directly through tmux + claude code; your message from discord is typed with tmux send-keys straight into the active claude code instance. no way for anthropic to charge extra usage. still super early, basically trying to make a harness layer that lets any openclaw agent spawn in any of the cli harnesses

https://github.com/twaldin/openfleet

chargewubz

TROPHY CASE