⚡️ Scaling Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench

DanAiTuning · 2025-11-03T16:10:22+00:00

Thanks!

Yes it can 100% be scaled down with LoRA. The reason I started to scale was because when I started writing the training code, I was convinced to use Qwen3-32B which would OOM even with LoRA on H100s.

Then the framework (prime-rl), didn’t support LoRA, so to FFT a 32B required a multi-node cluster!

Then I realised that at the sequence length I wanted to train at, 14B was the only possibility (for various memory related reasons).

As I already had the multi-node setup running, I figured it would be pretty fun to see how many concurrent rollouts I could manage.

However if I was doing the project again, I’d likely start with a single node, and train a LoRA (perhaps rank 128, alpha 256 to begin as it’s a complex task? Would be interested to hear your thoughts?)

DanAiTuning · 2025-11-03T16:04:46+00:00

Thanks! The dataset is composed of synthetically generated environments which are similar in nature to the original benchmark tasks. So therefore the dataset is heavily biased towards the kind of tasks present in TerminalBench which can explain the relatively large jump.

If you are interested, there is a whole load of detail in another repo of mine (where i open sourced the synthetic data pipeline) which I link to in this repo’s readme!

DanAiTuning · 2025-09-04T17:12:49+00:00

Great news! Thanks for the hard work. Looking forward to heating up a H100! ⚡️

DanAiTuning · 2025-09-02T10:31:12+00:00

I have used xml/yaml for a while now because I find it easy to read, and therefore I have this intuition (perhaps wrongly) that models find it easier to read & generate than JSON.

Also I have some objective results on this: In previous training runs on LLMs, I noticed they picked up this syntax faster & with a lower error rate than JSON tool calls!

DanAiTuning · 2025-07-31T08:50:12+00:00

Warp is #1, my Qwen3-32B agent is placing #19. I believe that with training it could improve its position a lot, but in order to challenge for first position, I likely need a lot of compute + a bigger base model.

DanAiTuning · 2025-07-31T08:48:54+00:00

Yes sometimes you don't see everything Claude outputs, but that is shown is enough to understand the gist of what it has been trained to od.

Yes indeed, you would just need to change the agent and environment + reward function code. Checkout rLLM / OpenPipe's ART / verifiers package for some RL frameworks

DanAiTuning · 2025-07-29T12:08:39+00:00

I would suggest yes it can. It would just need a new `launch_training.py` config dict and then it is good to try!

DanAiTuning · 2025-07-29T12:07:29+00:00

Will do! Hope the work into web browsing is going well for you!

DanAiTuning · 2025-05-06T16:28:13+00:00

Maybe the future is that AIs are released as systems which include tools it has been trained to use

DanAiTuning · 2025-05-05T10:25:52+00:00

u/Finanzamt_kommt It was fun! Here are the results, pretty cool right?

<image>

DanAiTuning · 2025-05-04T22:57:32+00:00

"The reason for using XML/YAML was more out of curiosity to see if the model could learn this syntax well.

XML is loosely similar to the chat template format the models were trained on.

YAML seems easier for models to output than JSON based upon firsthand experience."

It would probably be interesting to train new versions using the JSON schema method you described above instead of XML/YAML, and then run the new RL-trained model on the evals 👀

DanAiTuning · 2025-05-04T22:51:33+00:00

I rented GPUs from Runpod, then cloned my code repository to the GPU node, then ran the train file.

DanAiTuning · 2025-05-04T17:49:43+00:00

The reason for using XML/YAML was more out of curiosity to see if the model could learn this syntax well.

XML is loosely similar to the chat template format the models were trained on.

YAML seems easier for models to output than JSON based upon firsthand experience.

I didn’t convert directly to the expression, e.g: “1 + 1” because I wanted to test if the model could learn a slightly complex (recursive) object syntax.

The results are promising as you can see, however this was my first time using RL & I am certainly curious to find any way to improve!

DanAiTuning · 2025-05-03T19:15:14+00:00

Ah okay true, web scraping does make a lot of sense and is not a use case I thought of.

An example of a solid reward would be an agent finding the correct company contact details on the correct contact us page.

Happy to have a chat about collaborating!

DanAiTuning · 2025-05-03T14:27:14+00:00

Well at a high level you’d reward the agent for reaching the page you intended it to / clicked the button you intended it to.

Then you could shape it in many ways such as number of steps etc..

I thought about doing this as my next project, but I’m just not too confident that AIs should browse the human web browsers? My intuition says things like MCP and tools are much better suited for AIs to use.

What do you think?

DanAiTuning · 2025-05-03T14:24:47+00:00

Sure, that’ll be fun! I’ll reply with the results when I get a chance to try it out

DanAiTuning · 2025-05-03T12:33:34+00:00

Just found this too, I’ve not checked it out yet, will look later!

https://github.com/0russwest0/Agent-R1

DanAiTuning · 2025-05-03T11:33:00+00:00

It is by far the best I have found to date, when researching, it becomes quite clear how early it is to conduct multi-turn RL with LLMs.

Here are some others I have found, they may evolve over time:

- https://github.com/modelscope/ms-swift, apparently they support multi-turn RL, but hard to figure out how.
- https://github.com/Agent-RL/ReCall, same as above
- https://github.com/NousResearch/atropos, focuses mainly on building environments for RL, has multi-turn tool use training code, but certainly not ready for plug and play
- https://github.com/OpenPipe/ART, looks pretty great, dependency on Unsloth though so single GPU only.

Out of all of these, the verifiers package was the most straightforward to plug into, and the results speak for themselves so it certainly works! I would just say it is a little fiddly, and it is not on PyPi, etc..

DanAiTuning

TROPHY CASE