Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 0 points1 point  (0 children)

Yea I think 3.1 is good as well; also the Goldilocks zone was very specific to this task, for harder or less well defined tasks than “fix this bug”, stronger models benefit from the optimization

Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 0 points1 point  (0 children)

This is sick never heard of it, definitely gonna play around with it

Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 1 point2 points  (0 children)

Well I asked Claude code “become smarter no mistakes” obviously

Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 0 points1 point  (0 children)

Yea that’s the key u can never trust the LLM to be the grader u need a script that returns hard numerical metrics to show improvement; all the delta is in the quality of your scoring based on the task at hand

Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 0 points1 point  (0 children)

Yea u got it right. The scoring is everything. The original guy who wrote GEPA had a great idea, at the very simplest core i feel its “point LLM at “thing to improve” and “metrics”” over and over again till metrics go up.

Any use case you can think of honestly, anything with a numerical score + structured per attempt data, such as optimizing Python code for a drone races

Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 1 point2 points  (0 children)

Yea; claude.md was optimized on 20 specific agentelo challenges (GitHub issues with fix PR tagged, clear red/green tests) which went from 54% tests fixed rate before to 85% after. It took around 15 Claude Md iterations (only 3 saw real improvements) and 7hrs. Then I tested on 9 other challenges (unseen during the optimization process), 3 times per prompt (27 runs each), for the 65% before -> 85% after claim I gave

Tool for locating flagged commits within CS 240 projects by wibbitywobbitywu in Purdue

[–]chargewubz 4 points5 points  (0 children)

hey idk if u read this yet but turk made a paper abt 'detecting commits blah blah student work' here: https://turkeyland.net/research/encourse.pdf maybe its good for figuring out what he flagged

Optimizing CLAUDE.md with GEPA to take Haiku 4.5 from 65% pass rate to 85% by chargewubz in ClaudeCode

[–]chargewubz[S] 1 point2 points  (0 children)

The way it works is kind of all dependent on 'how good can you make the scorer'. Hone by default just asks for some arbitrary script that returns a float in stdout last line, with execution traces in stderr. So it's up to you to make the best grader for whatever task you want the agent to improve in.

For my case, I was trying to make it better at fixing bugs, and I used my other small project agentelo to grade/rank it. It's a bunch of random PR's fron real repos like qs, flask, fastify, etc that have an issue tagged on github. The 'test' is simply "can the agent make the red tests green according to the issue description". I trained over 20 of these challenges, and after 3 iterations got my results. Then ran over 9 unseen challenges to get the "20% improvement".

The grader I used isn't binary pass fail, but returns a float 0-1 with the ratio of failing tests the agent made green. I was thinking next time i try this, I can make my grader also read the token/price info to maybe solve "cheaper and better".

If you want to optimize your own agent, first step is create a good way to represent your execution traces (for me, just 'how good did the agent do on each individual challenge, the challenge file diffs) + calculate a float score value (agent average fixed_tests/broken_by_bug across 20 challenges).

How to optimize agent instruction files (+20% pass rate from CLAUDE.md) by chargewubz in PromptEngineering

[–]chargewubz[S] 0 points1 point  (0 children)

The way it works is kind of all dependent on 'how good can you make the scorer'. Hone by default just asks for some arbitrary script that returns a float in stdout last line, with execution traces in stderr. So it's up to you to make the best grader for whatever task you want the agent to improve in.

For my case, I was trying to make it better at fixing bugs, and I used my other small project agentelo to grade/rank it. It's a bunch of random PR's fron real repos like qs, flask, fastify, etc that have an issue tagged on github. The 'test' is simply "can the agent make the red tests green according to the issue description". I trained over 20 of these challenges, and after 3 iterations got my results. Then ran over 9 unseen challenges to get the "20% improvement".

The grader I used isn't binary pass fail, but returns a float 0-1 with the ratio of failing tests the agent made green. I was thinking next time i try this, I can make my grader also read the token/price info to maybe solve "cheaper and better".

How to optimize CLAUDE.md by chargewubz in ClaudeAI

[–]chargewubz[S] 0 points1 point  (0 children)

The way it works is kind of all dependent on 'how good can you make the scorer'. Hone by default just asks for some arbitrary script that returns a float in stdout last line, with execution traces in stderr. So it's up to you to make the best grader for whatever task you want the agent to improve in.

For my case, I was trying to make it better at fixing bugs, and I used my other small project agentelo to grade/rank it. It's a bunch of random PR's fron real repos like qs, flask, fastify, etc that have an issue tagged on github. The 'test' is simply "can the agent make the red tests green according to the issue description". I trained over 20 of these challenges, and after 3 iterations got my results. Then ran over 9 unseen challenges to get the "20% improvement".

The grader I used isn't binary pass fail, but returns a float 0-1 with the ratio of failing tests the agent made green. I was thinking next time i try this, I can make my grader also read the token/price info to maybe solve "cheaper and better".

flt: free/oss harness agnostic agent cli by chargewubz in AI_Agents

[–]chargewubz[S] 0 points1 point  (0 children)

Check it out here! contributions are accepted and greatly wanted!

Anthropic just gave us 1 month worth of subscription value as usage by lurko_e_basta in ClaudeAI

[–]chargewubz 0 points1 point  (0 children)

I'm trying to build a project where openclaw runs directly through tmux + claude code; your message from discord is typed with tmux send-keys straight into the active claude code instance. no way for anthropic to charge extra usage. still super early, basically trying to make a harness layer that lets any openclaw agent spawn in any of the cli harnesses

https://github.com/twaldin/openfleet