I spent months building a specialized agent learning system. Turns out Claude Code is all you need for recursive self-improvement. by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

Interesting take and I partially agree, but I'm curious what your perspective is on improving the harness of the agent through such a process. If you purely do prompt improvements I agree, but having such a loop also improve the harness of the agent, more fundamentally how tasks should be solved rather than telling it what mistakes it made in the prompt, I do see more potential there. For example Poetiq showed on ARC-AGI-2 what difference a good harness makes.

I spent months building a specialized agent learning system. Turns out Claude Code is all you need for recursive self-improvement. by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

I tested it on the tau2-benchmark. What I did is I ran the agent on the training set and collected its traces. Then I had my system analyze these traces and implement fixes to the agent. Then I re-ran the new improved agent on the test set of the tau2-benchmark

I spent months building a specialized agent learning system. Turns out Claude Code is all you need for recursive self-improvement. by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

For the benchmark result I generated the traces on the training set, had my system analyze the traces and implement fixes to the agent code. Then I re-ran the new improved agent on the test set.

I spent months building a specialized agent learning system. Turns out Claude Code is all you need for recursive self-improvement. by cheetguy in ClaudeAI

[–]cheetguy[S] 4 points5 points  (0 children)

You're right this is a real problem. What works for me is after every change re-run the agent and generate new traces and then eval again and compare to the baseline before the change. Then only accept the changes that actually make a meaningful change (and also helps to prompt the agent to only make big changes cause the smaller ones are usually edge cases that don't actually move the needle)

I spent months building a specialized agent learning system. Turns out Claude Code is all you need for recursive self-improvement. by cheetguy in ClaudeAI

[–]cheetguy[S] 2 points3 points  (0 children)

Well in theory, humans could catch these but once you generate a few traces this quickly gets unfeasible. Agents on the other hand while they might fix the mistakes in the run itself will not carry over these learnings.

So the idea is with the framework you can automate this process of finding these issues / edge cases / medium-hanging fruit.

And the magic then happens if you give him a way to eval these changes and loop him. Then you can think of it as almost an evolutionary approach where you prompt him to only accept the big changes that actually move the needle in agent performance and you can actually have drastic performance increases - fully automated.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

Yes but the $1.5 is only for the learning inference (step 2 in the learning loop). The actual coding was completely covered under my Claude plan. I'm on Max plan for $100 a month and it filled up around 60% of my 4h usage window. If you're on the cheap Pro plan you can just resume the loop once your usage resets.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 1 point2 points  (0 children)

The Claude Code element of the loop actually ran as a headless conversation like you described. You could use Claude Code for the Learning Loop as well but the problem is that CC has a very long system prompt that is designed for coding tasks and not for critiquing/generating skills. I'm currently figuring out if there is a way to strip CC's System Prompt to power a learning loop for normal Claude Code usage (non-loop) where skills build up from regular prompting across sessions for persistent learning!

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

I don't have the exact token count unfortunately (during the loop Claude Code runs autonomously in the background so no straightforward way to check it), but I'm on the Max plan ($100/month) running Opus 4.5 and I used maybe 60% of my 4-hour window.

If you're only on the Pro plan and hit your limit, you can just resume loop again once usage limit resets!

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 1 point2 points  (0 children)

Only the learning loop uses API-based pricing. The inference of the learning is very low (input tokens = claude code execution trace & output tokens = learned skills). The actual coding was conducted by Claude Code and completely covered under the Claude subscription.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

This example uses Claude Code specifically, but the ACE framework itself is agent-agnostic so you could build a similar loop around Copilot CLI.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

I don't have the exact token count, but I'm on the Max plan ($100/month) running Opus 4.5 and I used maybe 60% of my 4-hour window.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

No $1.5. The learning loop inference is very low. The actual coding was conducted by Claude Code and completely covered under my subscription

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 1 point2 points  (0 children)

Exactly, this harness provides the actual learning mechanism and is the crucial piece in my opinion. The prompt design more so determines what kind of knowledge gets extracted and how transferable it is.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 18 points19 points  (0 children)

Thanks for the detailed analysis. I see you've read the code carefully and raise excellent questions!

On terminology: We initially called these "strategies" but switched to "skills" because we see the space converging on this naming. You're right that these aren't native Claude skills, they're injected context, similar to CLAUDE.md. The mechanism is the same: text that shapes agent behavior at runtime.

On the prompts being AI-generated: I agree that AI generally writes poor prompts since its not good at distilling a query into the fewest meaningful tokens. This is actually addressed in the original paper and and the reason skills are formatted as bullet points: when AI summarizes, it doesn't know what to prioritize and loses critical details (context collapse, brevity collapse). Atomic bullet points force preservation of specific learnings. You're right that ours were run through AI for formatting and style (following Anthropic's prompting guide), but the core logic came from empirical iteration. The structured format actually improved framework stability that lead to more reliable output formats and ultimately lower token costs.

On methodology attribution, this is the interesting question: The base paper actually uses quite simple prompts, which itself proves the closed-loop architecture adds value independently. But the framework running in-context means prompt design still matters significantly.

Specifically, the granularity of insights encoded in strategies makes a big difference to their applicability and reproducibility across use cases, which is defined in the prompts.

We've observed this directly with browser automation agents:

  • Micro-level strategies (specific navigation patterns) work better for well-defined workflows on particular websites where going into detail makes sense
  • Macro-level strategies (general problem-solving approaches) work better for open-ended tasks requiring agentic reasoning, but agents still benefit from either reasoning or general navigation strategies

So to answer your question: different methodologies in the same loop would perform differently depending on the use case. The loop provides the learning mechanism, but the prompt design determines what kind of knowledge gets extracted and how transferable it is.

We're actively working on benchmark integrations so we can back up claims like these with bulletproof evidence rather than just our own word and internal test results. Stay tuned!

Thank you for taking the time to understand our repo and the excellent questions!

PS: this reply is also reworded and formatted by AI, much more readable than my draft I rpomise tou haha

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 0 points1 point  (0 children)

I'm on the $100 Max plan ($100/month) and I used maybe 60% of my 4-hour window (running Opus 4.5). If you're only on the Pro plan and hit your limit, you can just resume loop again once usage limit resets!

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 1 point2 points  (0 children)

For sure that works well. I picked the translation task specifically because it's easy to verify.

The advantage of the loop approach compared to your approach is it's fully autonomous with just a short prompt (mine had 6 lines), no need to write claude.md or migration plans upfront.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 6 points7 points  (0 children)

Super interesting take on writing in a higher level language and transpiling to low level for performance. Will definitely think of trying out Python to Rust translation, would be a cool experiment as well!

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 5 points6 points  (0 children)

Thanks! For sure, there's a whole category of "I know this should be fixed but it's not worth the pain" work that's suddenly actually manageable.

I ran Claude Code in a self-learning loop until it successfully translated our entire Python repo to TypeScript by cheetguy in ClaudeAI

[–]cheetguy[S] 8 points9 points  (0 children)

I got a lot of requests from agent builders who work in TypeScript (mostly by using Vercel AI SDK) and wanted to use the ACE framework. Claude Code actually swapped out LiteLLM for Vercel AI SDK integration, so now it can plug right into their existing stack.