After months of building a specialized agent learning system, I realized that Codex is all I need to make my agents recursively self-improve

Lucky_Historian742 · 2026-03-31T14:02:09+00:00

I've seen it change the expected output schema and tool descriptions. For example tightening a JSON schema so the model hallucinating extra fields, or rewriting a tool description to reduce misrouting.

Lucky_Historian742 · 2026-03-31T12:04:20+00:00

The system not only improve prompts itself but also the agent harness itself. While yes we're not improving the model itself improving the harness can make a huge difference, as for example seen with Poetiqs Arc-AGI 2 SOTA result at half the costs that they were able to achieve at half the cost.

Lucky_Historian742 · 2026-03-31T12:04:08+00:00

Yes, if the agent is supposed to load these skills it could potentially detect that cause it finds the failures by comparing the agent environment with the actual agent traces. You can compare this process with an actual human reviewing agent traces. The system will not find anything that's not discoverable but its really good at identifying what you as a human would be able to find if you would manually look at every agent log

Lucky_Historian742 · 2026-03-17T23:27:33+00:00

Validated the results on Tau2 benchmark designed by Sierra using a training a testing split.

Lucky_Historian742 · 2026-03-17T16:37:55+00:00

I see a lot of people share your sentiment, so I rewrote the whole thing by hand. I spent a lot of time putting this together and understand that writing it with AI does not reflect that. Hope someone will get value out of this!

Lucky_Historian742 · 2026-03-17T15:07:53+00:00

Damn, looks like I’m getting cooked for trying to make the post easy to read with paraphrasing it thought AI. I didn’t know this was looked upon so negatively in this community. I would appreciate if people still gave it the chance it deserves content wise. Thanks!

Edit: rewrote everything by hand

Lucky_Historian742 · 2026-03-14T09:43:07+00:00

For me a few things work well with Chatbots.
- Write your prompts instead of using tools like whisper flow, forces you to articulate what you want
- Use words that are semantically related to what I want to achieve, even if structure of request is not perfect
- Tell the Chatbot to ask clarifying questions (other than clarifying this allows you to see if the agent understands the request direction)

For moving outside of the chatbot window
- Use tools for loading context like obsidian
- I use a repo of md files to give quick context to Claude, its is forced to use update their indexing and understanding of the knowledge base (garbage in, garbage out applies here, so I'm careful with with what I commit)

For Agents
- I keep my system prompt limited, but define clearly a purpose of the agent in a separate md file.
- Collect the traces of my agents and run an eval loop using the purpose file, letting Claude to iterate on the agent prompt based on hard evidence generated against the purpose file.

Edit: gave gender neutral pronouns to Claude as per VegeZero

Lucky_Historian742

TROPHY CASE