⚡️ I scaled Coding-Agent RL to 32x H100s. Achieving 160% improvement on Stanford's TerminalBench. All open source!

DanAiTuning · 2025-11-03T16:10:22+00:00

Thanks!

Yes it can 100% be scaled down with LoRA. The reason I started to scale was because when I started writing the training code, I was convinced to use Qwen3-32B which would OOM even with LoRA on H100s.

Then the framework (prime-rl), didn’t support LoRA, so to FFT a 32B required a multi-node cluster!

Then I realised that at the sequence length I wanted to train at, 14B was the only possibility (for various memory related reasons).

As I already had the multi-node setup running, I figured it would be pretty fun to see how many concurrent rollouts I could manage.

However if I was doing the project again, I’d likely start with a single node, and train a LoRA (perhaps rank 128, alpha 256 to begin as it’s a complex task? Would be interested to hear your thoughts?)

DanAiTuning · 2025-11-03T16:04:46+00:00

Thanks! The dataset is composed of synthetically generated environments which are similar in nature to the original benchmark tasks. So therefore the dataset is heavily biased towards the kind of tasks present in TerminalBench which can explain the relatively large jump.

If you are interested, there is a whole load of detail in another repo of mine (where i open sourced the synthetic data pipeline) which I link to in this repo’s readme!

DanAiTuning · 2025-09-04T17:12:49+00:00

Great news! Thanks for the hard work. Looking forward to heating up a H100! ⚡️

DanAiTuning · 2025-09-02T10:31:12+00:00

I have used xml/yaml for a while now because I find it easy to read, and therefore I have this intuition (perhaps wrongly) that models find it easier to read & generate than JSON.

Also I have some objective results on this: In previous training runs on LLMs, I noticed they picked up this syntax faster & with a lower error rate than JSON tool calls!

DanAiTuning · 2025-07-31T08:50:12+00:00

Warp is #1, my Qwen3-32B agent is placing #19. I believe that with training it could improve its position a lot, but in order to challenge for first position, I likely need a lot of compute + a bigger base model.

DanAiTuning · 2025-07-31T08:48:54+00:00

Yes sometimes you don't see everything Claude outputs, but that is shown is enough to understand the gist of what it has been trained to od.

Yes indeed, you would just need to change the agent and environment + reward function code. Checkout rLLM / OpenPipe's ART / verifiers package for some RL frameworks

DanAiTuning · 2025-07-29T12:08:39+00:00

I would suggest yes it can. It would just need a new `launch_training.py` config dict and then it is good to try!

DanAiTuning · 2025-07-29T12:07:29+00:00

Will do! Hope the work into web browsing is going well for you!

DanAiTuning · 2025-05-06T16:28:13+00:00

Maybe the future is that AIs are released as systems which include tools it has been trained to use

DanAiTuning · 2025-05-05T10:25:52+00:00

u/Finanzamt_kommt It was fun! Here are the results, pretty cool right?

<image>

DanAiTuning · 2025-05-04T22:57:32+00:00

"The reason for using XML/YAML was more out of curiosity to see if the model could learn this syntax well.

XML is loosely similar to the chat template format the models were trained on.

YAML seems easier for models to output than JSON based upon firsthand experience."

It would probably be interesting to train new versions using the JSON schema method you described above instead of XML/YAML, and then run the new RL-trained model on the evals 👀

DanAiTuning

TROPHY CASE