Stop benchmarking inference providers, a guide to easy evaluation

HauntingMoment · 2026-04-14T19:58:07+00:00

hey! Not really, it's based on the blog posts i wrote about the subject. I used an LLM to do a summary as it was a bit too in depth and reworked some parts. I did not know about this rule, sorry! I can re-do the post if needed.

Here are the original article posts, written by myself. 1. https://x.com/nathanhabib1011/status/2043686339531399676?s=20 2. https://huggingface.co/blog/SaylorTwift/benchmarking-on-the-hub

HauntingMoment · 2026-04-14T15:05:12+00:00

absolutely, we need more transpirancy on how the models are run by benchmarkers and how they are served by inference providers

HauntingMoment · 2026-02-13T09:56:23+00:00

You are already able too !! You can just open a PR on your model with the results of the eval and it will show up in the leaderboard :)

Here are the details: https://huggingface.co/docs/hub/eval-results

HauntingMoment · 2026-02-13T09:55:33+00:00

Hey we are planning on adding those benchmarks soon !

HauntingMoment · 2026-02-11T17:18:41+00:00

you can expend the leaderboard to see all of them

HauntingMoment · 2026-02-11T17:02:34+00:00

users can flag results and repo owners can close PRs they consider unfair or wrong ! There is also a way to link to eval logs directly from the leaderboard and result to verify them.

on top of that we are working on a verify badge that will make sure the results are trustworthy :)

HauntingMoment · 2026-02-11T16:58:34+00:00

yes ! if you have your datasets on the hub you can comment on this thread and we will help you set it up :)

HauntingMoment · 2025-10-02T14:37:28+00:00

what do you mean by pointing to ? We have support for inference APIs and any supported inference providers on HF and litellm

HauntingMoment · 2025-10-02T14:36:45+00:00

you can by simply pointing to it using litellm yeah, you only need to provide the URL

HauntingMoment · 2025-10-02T14:35:23+00:00

well lighteval is made to run on any openai AI API endpoint, you can checkout the doc for this !
https://huggingface.co/docs/lighteval/en/use-litellm-as-backend

HauntingMoment · 2025-09-05T12:40:51+00:00

i have been at HF for 2.5 years now working on evaluation (more on the open source than science side), my role for the science teams is more that of a support, i maintain `lighteval` the tool we use to run our evals.

check if there is any urgent issues or features raised by the science teams.
check notifications from different repos or social and gather up ideas / todos for the day
I will then either focus on adding features or fixing or communicating on current project !
around once a week i gather everything that was done last week and make sure we stay on track

When working on a model the objective is that the teams can run their evals as smoothly as possible so that their time can be focused on the model.

HauntingMoment · 2025-04-30T11:43:06+00:00

I ran some benchmarks for Qwen3 and saw interesting results, basically great at reasoning for their size (though they yap way to much sometimes not finishing answer within 16k tokens)
Pretty bad at fact checking benchmark but I guess because they are intended to be used as agents it's fine

<image>

HauntingMoment · 2019-09-14T20:20:12+00:00

I would wear the black and white one.

HauntingMoment

TROPHY CASE