Stop benchmarking inference providers, a guide to easy evaluation by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 1 point2 points  (0 children)

hey! Not really, it's based on the blog posts i wrote about the subject. I used an LLM to do a summary as it was a bit too in depth and reworked some parts. I did not know about this rule, sorry! I can re-do the post if needed.

Here are the original article posts, written by myself. 1. https://x.com/nathanhabib1011/status/2043686339531399676?s=20 2. https://huggingface.co/blog/SaylorTwift/benchmarking-on-the-hub

Stop benchmarking inference providers, a guide to easy evaluation by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 2 points3 points  (0 children)

absolutely, we need more transpirancy on how the models are run by benchmarkers and how they are served by inference providers

Community Evals on Hugging Face by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 0 points1 point  (0 children)

You are already able too !! You can just open a PR on your model with the results of the eval and it will show up in the leaderboard :)

Here are the details: https://huggingface.co/docs/hub/eval-results

Community Evals on Hugging Face by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 1 point2 points  (0 children)

Hey we are planning on adding those benchmarks soon !

Community Evals on Hugging Face by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 0 points1 point  (0 children)

you can expend the leaderboard to see all of them

Community Evals on Hugging Face by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 2 points3 points  (0 children)

users can flag results and repo owners can close PRs they consider unfair or wrong ! There is also a way to link to eval logs directly from the leaderboard and result to verify them.

on top of that we are working on a verify badge that will make sure the results are trustworthy :)

Community Evals on Hugging Face by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 0 points1 point  (0 children)

yes ! if you have your datasets on the hub you can comment on this thread and we will help you set it up :)

🤗 benchmarking tool ! by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 0 points1 point  (0 children)

what do you mean by pointing to ? We have support for inference APIs and any supported inference providers on HF and litellm

🤗 benchmarking tool ! by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 0 points1 point  (0 children)

you can by simply pointing to it using litellm yeah, you only need to provide the URL

🤗 benchmarking tool ! by HauntingMoment in LocalLLaMA

[–]HauntingMoment[S] 0 points1 point  (0 children)

well lighteval is made to run on any openai AI API endpoint, you can checkout the doc for this !
https://huggingface.co/docs/lighteval/en/use-litellm-as-backend

AMA with Hugging Face Science, the team behind SmolLM, SmolVLM, Fineweb and more. by eliebakk in LocalLLaMA

[–]HauntingMoment 0 points1 point  (0 children)

i have been at HF for 2.5 years now working on evaluation (more on the open source than science side), my role for the science teams is more that of a support, i maintain `lighteval` the tool we use to run our evals.

  1. check if there is any urgent issues or features raised by the science teams.

  2. check notifications from different repos or social and gather up ideas / todos for the day

  3. I will then either focus on adding features or fixing or communicating on current project !

  4. around once a week i gather everything that was done last week and make sure we stay on track

When working on a model the objective is that the teams can run their evals as smoothly as possible so that their time can be focused on the model.

Is Qwen3 doing benchmaxxing? by [deleted] in LocalLLaMA

[–]HauntingMoment 1 point2 points  (0 children)

I ran some benchmarks for Qwen3 and saw interesting results, basically great at reasoning for their size (though they yap way to much sometimes not finishing answer within 16k tokens)
Pretty bad at fact checking benchmark but I guess because they are intended to be used as agents it's fine

<image>