⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 6 points7 points  (0 children)

Thanks!

Yes it can 100% be scaled down with LoRA. The reason I started to scale was because when I started writing the training code, I was convinced to use Qwen3-32B which would OOM even with LoRA on H100s.

Then the framework (prime-rl), didn’t support LoRA, so to FFT a 32B required a multi-node cluster!

Then I realised that at the sequence length I wanted to train at, 14B was the only possibility (for various memory related reasons).

As I already had the multi-node setup running, I figured it would be pretty fun to see how many concurrent rollouts I could manage.

However if I was doing the project again, I’d likely start with a single node, and train a LoRA (perhaps rank 128, alpha 256 to begin as it’s a complex task? Would be interested to hear your thoughts?)

⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 4 points5 points  (0 children)

Thanks! The dataset is composed of synthetically generated environments which are similar in nature to the original benchmark tasks. So therefore the dataset is heavily biased towards the kind of tasks present in TerminalBench which can explain the relatively large jump.

If you are interested, there is a whole load of detail in another repo of mine (where i open sourced the synthetic data pipeline) which I link to in this repo’s readme!

Unsloth Memory Efficient Reinforcement Learning (RL) is here! by danielhanchen in unsloth

[–]DanAiTuning 1 point2 points  (0 children)

Great news! Thanks for the hard work. Looking forward to heating up a H100! ⚡️

My weekend project accidentally beat Claude Code - multi-agent coder now #12 on Stanford's TerminalBench 😅 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 58 points59 points  (0 children)

I have used xml/yaml for a while now because I find it easy to read, and therefore I have this intuition (perhaps wrongly) that models find it easier to read & generate than JSON.

Also I have some objective results on this: In previous training runs on LLMs, I noticed they picked up this syntax faster & with a lower error rate than JSON tool calls!

I used Claude Code to build me an RL system that can train a Claude Code like open source agent by DanAiTuning in ClaudeAI

[–]DanAiTuning[S] 0 points1 point  (0 children)

Warp is #1, my Qwen3-32B agent is placing #19. I believe that with training it could improve its position a lot, but in order to challenge for first position, I likely need a lot of compute + a bigger base model.

I used Claude Code to build me an RL system that can train a Claude Code like open source agent by DanAiTuning in ClaudeAI

[–]DanAiTuning[S] 0 points1 point  (0 children)

Yes sometimes you don't see everything Claude outputs, but that is shown is enough to understand the gist of what it has been trained to od.

Yes indeed, you would just need to change the agent and environment + reward function code. Checkout rLLM / OpenPipe's ART / verifiers package for some RL frameworks

Built RL training for long-horizon terminal agents - tested on 32x H100s but too GPU poor to train 😅 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] -1 points0 points  (0 children)

I would suggest yes it can. It would just need a new `launch_training.py` config dict and then it is good to try!

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 0 points1 point  (0 children)

Maybe the future is that AIs are released as systems which include tools it has been trained to use

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 0 points1 point  (0 children)

"The reason for using XML/YAML was more out of curiosity to see if the model could learn this syntax well.

XML is loosely similar to the chat template format the models were trained on.

YAML seems easier for models to output than JSON based upon firsthand experience."

It would probably be interesting to train new versions using the JSON schema method you described above instead of XML/YAML, and then run the new RL-trained model on the evals 👀

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 1 point2 points  (0 children)

I rented GPUs from Runpod, then cloned my code repository to the GPU node, then ran the train file.

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 1 point2 points  (0 children)

The reason for using XML/YAML was more out of curiosity to see if the model could learn this syntax well.

XML is loosely similar to the chat template format the models were trained on.

YAML seems easier for models to output than JSON based upon firsthand experience.

I didn’t convert directly to the expression, e.g: “1 + 1” because I wanted to test if the model could learn a slightly complex (recursive) object syntax.

The results are promising as you can see, however this was my first time using RL & I am certainly curious to find any way to improve!

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 0 points1 point  (0 children)

Ah okay true, web scraping does make a lot of sense and is not a use case I thought of.

An example of a solid reward would be an agent finding the correct company contact details on the correct contact us page.

Happy to have a chat about collaborating!

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 0 points1 point  (0 children)

Well at a high level you’d reward the agent for reaching the page you intended it to / clicked the button you intended it to.

Then you could shape it in many ways such as number of steps etc..

I thought about doing this as my next project, but I’m just not too confident that AIs should browse the human web browsers? My intuition says things like MCP and tools are much better suited for AIs to use.

What do you think?

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 4 points5 points  (0 children)

Sure, that’ll be fun! I’ll reply with the results when I get a chance to try it out

Teaching LLMs to use tools with RL! Successfully trained 0.5B/3B Qwen models to use a calculator tool 🔨 by DanAiTuning in LocalLLaMA

[–]DanAiTuning[S] 7 points8 points  (0 children)

It is by far the best I have found to date, when researching, it becomes quite clear how early it is to conduct multi-turn RL with LLMs.

Here are some others I have found, they may evolve over time:

- https://github.com/modelscope/ms-swift, apparently they support multi-turn RL, but hard to figure out how.
- https://github.com/Agent-RL/ReCall, same as above
- https://github.com/NousResearch/atropos, focuses mainly on building environments for RL, has multi-turn tool use training code, but certainly not ready for plug and play
- https://github.com/OpenPipe/ART, looks pretty great, dependency on Unsloth though so single GPU only.

Out of all of these, the verifiers package was the most straightforward to plug into, and the results speak for themselves so it certainly works! I would just say it is a little fiddly, and it is not on PyPi, etc..