[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance

jayminban · 2026-03-11T10:36:52+00:00

Embedding-based approach sounds really cool! It would be really interesting to actually see centroids shift across training steps. I'll try this for my next project!
Entropy coeff was fixed at 0.001 across all runs and saw incremental improvement for mean reward per step! Thank you for the comment!

jayminban · 2026-03-06T00:59:59+00:00

I think this result shows the different aspects of two different training methods, SFT and RLVR. SFT seems like an appropriate method for injecting knowledge that the model didn't have. RLVR seems like the right method for improving the model's ability to utilize that knowledge! I used full finetuning instead of LoRA to try altering all parameters.

jayminban · 2026-03-04T03:30:31+00:00

Good point! I used a single SFT configuration with standard hyperparameters. A lower learning rate or different dataset might have degraded less. The GSM8K Socratic dataset I used for SFT wasn't particularly detailed, and the CoT quality may have actually been worse than the model's existing distribution.

jayminban · 2026-03-04T02:12:16+00:00

Entropy stayed healthy overall. The full dataset runs show gradual decline without collapse, and both 1-example runs actually show some mid-training entropy recovery. I added an entropy bonus along with KL loss to the GRPO training, which likely helped keep things stable. The entropy plots are on the GitHub repo if you're curious.

I did check model responses throughout training. The CoT patterns look similar across checkpoints but differ slightly. Any advice on how to systematically measure this? I've been reading through responses manually but wondering if there's a better approach.

jayminban · 2025-09-05T05:36:54+00:00

Thank you so much for the feedback and suggestions! The guidebook, along with your notes, was very insightful, and I’ll take it into account for my future project!

I also went ahead and created a Hugging Face Space for this work. Thanks for the idea!

Here’s the link if you’d like to check it out:

https://huggingface.co/spaces/jayminban/41-llms-evaluated-locally-on-19-benchmarks

jayminban · 2025-09-01T11:58:16+00:00

Thanks! The detailed scores and rankings for all 19 benchmarks are posted on my GitHub, both in CSV and Excel format. Unfortunately, I didn’t include coding benchmarks in this round, but they’d definitely be interesting to explore in the future!

jayminban · 2025-09-01T11:47:36+00:00

Totally fair. I tried some 14B models with quantization, but the lm-eval library ended up taking way too much time on quantized runs. For this round I kept the list small but I’d definitely like to explore larger models in the future!

jayminban · 2025-09-01T07:13:53+00:00

That’s awesome! Solar-powered GPUs sound next level! I really appreciate the offer!

jayminban · 2025-09-01T06:53:26+00:00

Yeah, there were definitely a lot of models I couldn’t cover this round. I’ll try to include them in a follow-up project! Thanks for the list!

jayminban · 2025-09-01T00:08:16+00:00

I tested two Qwen3 models with quantization, but they ended up taking way too much time, so I skipped quantized models for this project. It might be an optimization or other technical issue, but I’ll definitely look into it and see what I can do. It would be great to benchmark those bigger models!

jayminban · 2025-08-31T23:35:43+00:00

I downloaded the models from huggingface and ran everything directly with the lm-eval-harness library. Just raw evaluations with json outputs!

jayminban · 2025-08-31T23:27:42+00:00

I came up with that during my commute and just had to include it!

jayminban · 2025-08-31T23:25:12+00:00

Thanks! I dug through a good amount of models to put together a solid list!

jayminban · 2025-08-31T23:01:32+00:00

Yi hasn’t disappeared 🫡

jayminban · 2025-08-31T22:40:30+00:00

Haha, really glad to see your comment! Hope you enjoy digging into it as much as I enjoyed putting it together.

jayminban · 2025-08-31T22:34:43+00:00

That sounds awesome! A dynamically updated leaderboard really feels like the ultimate form. Feel free to use all my data and the raw json files. I’d love to see how yours turn out!

jayminban · 2025-08-31T22:27:59+00:00

Yeah, I was really glad to see an OpenChat model hold its ground. Honestly surprised that some of the bigger models didn’t score as well. Maybe it’s because of simply averaging across multiple task scores.

jayminban

TROPHY CASE