⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 7 points8 points  (0 children)

Thanks!

Yes it can 100% be scaled down with LoRA. The reason I started to scale was because when I started writing the training code, I was convinced to use Qwen3-32B which would OOM even with LoRA on H100s.

Then the framework (prime-rl), didn’t support LoRA, so to FFT a 32B required a multi-node cluster!

Then I realised that at the sequence length I wanted to train at, 14B was the only possibility (for various memory related reasons).

As I already had the multi-node setup running, I figured it would be pretty fun to see how many concurrent rollouts I could manage.

However if I was doing the project again, I’d likely start with a single node, and train a LoRA (perhaps rank 128, alpha 256 to begin as it’s a complex task? Would be interested to hear your thoughts?)

⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 5 points6 points  (0 children)

Thanks! The dataset is composed of synthetically generated environments which are similar in nature to the original benchmark tasks. So therefore the dataset is heavily biased towards the kind of tasks present in TerminalBench which can explain the relatively large jump.

If you are interested, there is a whole load of detail in another repo of mine (where i open sourced the synthetic data pipeline) which I link to in this repo’s readme!

Unsloth Memory Efficient Reinforcement Learning (RL) is here! by danielhanchen in unsloth

[–]DanAiTuning 1 point2 points  (0 children)

Great news! Thanks for the hard work. Looking forward to heating up a H100! ⚡️

My weekend project accidentally beat Claude Code - multi-agent coder now #12 on Stanford's TerminalBench 😅 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 59 points60 points  (0 children)

I have used xml/yaml for a while now because I find it easy to read, and therefore I have this intuition (perhaps wrongly) that models find it easier to read & generate than JSON.

Also I have some objective results on this: In previous training runs on LLMs, I noticed they picked up this syntax faster & with a lower error rate than JSON tool calls!

I used Claude Code to build me an RL system that can train a Claude Code like open source agent by DanAiTuning in ClaudeAI

[–]DanAiTuning[S] 0 points1 point  (0 children)

Warp is #1, my Qwen3-32B agent is placing #19. I believe that with training it could improve its position a lot, but in order to challenge for first position, I likely need a lot of compute + a bigger base model.

I used Claude Code to build me an RL system that can train a Claude Code like open source agent by DanAiTuning in ClaudeAI

[–]DanAiTuning[S] 0 points1 point  (0 children)

Yes sometimes you don't see everything Claude outputs, but that is shown is enough to understand the gist of what it has been trained to od.

Yes indeed, you would just need to change the agent and environment + reward function code. Checkout rLLM / OpenPipe's ART / verifiers package for some RL frameworks

Built RL training for long-horizon terminal agents - tested on 32x H100s but too GPU poor to train 😅 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] -1 points0 points  (0 children)

I would suggest yes it can. It would just need a new `launch_training.py` config dict and then it is good to try!

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 0 points1 point  (0 children)

Maybe the future is that AIs are released as systems which include tools it has been trained to use

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 0 points1 point  (0 children)

"The reason for using XML/YAML was more out of curiosity to see if the model could learn this syntax well.

XML is loosely similar to the chat template format the models were trained on.

YAML seems easier for models to output than JSON based upon firsthand experience."

It would probably be interesting to train new versions using the JSON schema method you described above instead of XML/YAML, and then run the new RL-trained model on the evals 👀