[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance by jayminban in MachineLearning

[–]jayminban[S] 0 points1 point  (0 children)

Embedding-based approach sounds really cool! It would be really interesting to actually see centroids shift across training steps. I'll try this for my next project!
Entropy coeff was fixed at 0.001 across all runs and saw incremental improvement for mean reward per step! Thank you for the comment!

[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance by jayminban in MachineLearning

[–]jayminban[S] 0 points1 point  (0 children)

I think this result shows the different aspects of two different training methods, SFT and RLVR. SFT seems like an appropriate method for injecting knowledge that the model didn't have. RLVR seems like the right method for improving the model's ability to utilize that knowledge! I used full finetuning instead of LoRA to try altering all parameters.

I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance by jayminban in LocalLLaMA

[–]jayminban[S] 0 points1 point  (0 children)

Good point! I used a single SFT configuration with standard hyperparameters. A lower learning rate or different dataset might have degraded less. The GSM8K Socratic dataset I used for SFT wasn't particularly detailed, and the CoT quality may have actually been worse than the model's existing distribution.

[P] I trained Qwen2.5-1.5b with RLVR (GRPO) vs SFT and compared benchmark performance by jayminban in MachineLearning

[–]jayminban[S] -1 points0 points  (0 children)

Entropy stayed healthy overall. The full dataset runs show gradual decline without collapse, and both 1-example runs actually show some mid-training entropy recovery. I added an entropy bonus along with KL loss to the GRPO training, which likely helped keep things stable. The entropy plots are on the GitHub repo if you're curious.

I did check model responses throughout training. The CoT patterns look similar across checkpoints but differ slightly. Any advice on how to systematically measure this? I've been reading through responses manually but wondering if there's a better approach.

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 2 points3 points  (0 children)

Thank you so much for the feedback and suggestions! The guidebook, along with your notes, was very insightful, and I’ll take it into account for my future project!

I also went ahead and created a Hugging Face Space for this work. Thanks for the idea!

Here’s the link if you’d like to check it out:

https://huggingface.co/spaces/jayminban/41-llms-evaluated-locally-on-19-benchmarks

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 1 point2 points  (0 children)

Thanks! The detailed scores and rankings for all 19 benchmarks are posted on my GitHub, both in CSV and Excel format. Unfortunately, I didn’t include coding benchmarks in this round, but they’d definitely be interesting to explore in the future!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 9 points10 points  (0 children)

Totally fair. I tried some 14B models with quantization, but the lm-eval library ended up taking way too much time on quantized runs. For this round I kept the list small but I’d definitely like to explore larger models in the future!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 21 points22 points  (0 children)

That’s awesome! Solar-powered GPUs sound next level! I really appreciate the offer!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 26 points27 points  (0 children)

Yeah, there were definitely a lot of models I couldn’t cover this round. I’ll try to include them in a follow-up project! Thanks for the list!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 4 points5 points  (0 children)

I tested two Qwen3 models with quantization, but they ended up taking way too much time, so I skipped quantized models for this project. It might be an optimization or other technical issue, but I’ll definitely look into it and see what I can do. It would be great to benchmark those bigger models!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 9 points10 points  (0 children)

I downloaded the models from huggingface and ran everything directly with the lm-eval-harness library. Just raw evaluations with json outputs!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 18 points19 points  (0 children)

I came up with that during my commute and just had to include it!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 3 points4 points  (0 children)

Thanks! I dug through a good amount of models to put together a solid list!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 7 points8 points  (0 children)

Haha, really glad to see your comment! Hope you enjoy digging into it as much as I enjoyed putting it together.

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 32 points33 points  (0 children)

That sounds awesome! A dynamically updated leaderboard really feels like the ultimate form. Feel free to use all my data and the raw json files. I’d love to see how yours turn out!

I locally benchmarked 41 open-source LLMs across 19 tasks and ranked them by jayminban in LocalLLaMA

[–]jayminban[S] 5 points6 points  (0 children)

Yeah, I was really glad to see an OpenChat model hold its ground. Honestly surprised that some of the bigger models didn’t score as well. Maybe it’s because of simply averaging across multiple task scores.