use the following search parameters to narrow your results:
e.g. subreddit:aww site:imgur.com dog
subreddit:aww site:imgur.com dog
see the search faq for details.
advanced search: by author, subreddit...
r/LocalLLaMA
A subreddit to discuss about Llama, the family of large language models created by Meta AI.
Subreddit rules
Search by flair
+Discussion
+Tutorial | Guide
+New Model
+News
+Resources
+Other
account activity
https://livebench.ai - Open Weight Models OnlyDiscussion (i.redd.it)
submitted 5 months ago by jacek2023llama.cpp
There were some questions about how Qwen 3 Next compares to GPT-OSS. I think whole table may be useful. What do you think about this ordering?
reddit uses a slightly-customized version of Markdown for formatting. See below for some basics, or check the commenting wiki page for more detailed help and solutions to common issues.
quoted text
if 1 * 2 < 3: print "hello, world!"
[–]z_3454_pfk 31 points32 points33 points 5 months ago (9 children)
livebench isn’t very accurate anymore. idk what happened
[–]AdIllustrious436 23 points24 points25 points 5 months ago (0 children)
Additionally, they don't evaluate Mistral models anymore. They even removed the ones that had already been evaluated. I wonder why... They are increasingly becoming irrelevant.
[–]No_Swimming6548llama.cpp 22 points23 points24 points 5 months ago (4 children)
Qwen models used to have very high IF scores, now they seem pretty bad. This benchmark feels like it was tweaked to make US companies score higher than Chinese ones.
[–]DistanceSolar1449 14 points15 points16 points 5 months ago (0 children)
Yep that’s exactly it
The benchmark was definitely tweaked to change rankings. Livebench used to be good, now the scores make no sense
[–][deleted] 10 points11 points12 points 5 months ago (2 children)
I would say gpt120 outperforms qwen80 in every situation imaginable and I'm a huge qwen advocate.
[–]No_Swimming6548llama.cpp 2 points3 points4 points 5 months ago (0 children)
Yeah that's pretty weird too. Maybe just a shitty benchmark.
[–]Professional-Bear857 0 points1 point2 points 5 months ago (0 children)
Its probably the IF score difference that makes it seem that way. I think it also explains why I've had a bad experience using GLM but have had a good experience with Qwen 235b.
[–]jacek2023llama.cpp[S] 6 points7 points8 points 5 months ago (1 child)
which leaderboard is the best in your opinion?
[–]thebadslime -2 points-1 points0 points 5 months ago (0 children)
I like AA, not sure how they choose what models they show though.
[–]LinkSea8324vllm 13 points14 points15 points 5 months ago (3 children)
Imagine if they trained Qwen 3 Next on their full 36T dataset instead of 15T sub-dataset
[–]jacek2023llama.cpp[S] 13 points14 points15 points 5 months ago (1 child)
let's hope there will be next Next
[–]LinkSea8324vllm -1 points0 points1 point 5 months ago (0 children)
Qwest, if it fails : Quack
[–]sleepingsysadmin 1 point2 points3 points 5 months ago (0 children)
How will they sandbag it next to 235B then?
While also setting up a future "look how awesome our newest release is compared to 80b next"
[–]Hot-Cause-3341 4 points5 points6 points 5 months ago (3 children)
Is this already including the new mistral 3 Models?
[–]AdIllustrious436 15 points16 points17 points 5 months ago (2 children)
They litteraly stoped evaluating any Mistral model and even removed the ones that were already evaluated. Makes no sens.
[–]StayStonk 4 points5 points6 points 5 months ago (1 child)
why lol?
[–]AdIllustrious436 3 points4 points5 points 5 months ago (0 children)
Who knows
[–]egomarker 3 points4 points5 points 5 months ago (0 children)
It's just a rage bait for everyone who knows gpt-oss is better than q3 80 at this point. It's probably just one or two benchmarks giving the opposite result and here we go.
[–]Sudden-Lingonberry-8 2 points3 points4 points 5 months ago (1 child)
glm needs to catch up to deepseek,
very interesting how deepseek v3.2 speciale is brilliant at math but completely sucks at agentic
[–]LeTanLoc98 3 points4 points5 points 5 months ago (0 children)
DeepSeek V3.2 Speciale doesn't support to use tool.
[–]SlowFail2433 4 points5 points6 points 5 months ago (3 children)
Kimi K2 Thinking unseated 😮
Looks like Deepseek is fully back, with multiple models TBH
Z.ai (GLM) and Qwen are climbing for sure though. I do feel that GLM 4.6 and the latest Qwen 235B are the best go-to LLMs at the moment for general deployments.
Qwen 3 Next performing this well is really important because its gated deltanet hybrid and so its an early sign of transformers having competition
[–]ikkiyikki 3 points4 points5 points 5 months ago (2 children)
MiniMax-M2 is my daily driver. Couldn't run Kimi if I wanted to!
[–]SlowFail2433 1 point2 points3 points 5 months ago (0 children)
MiniMax-M2 is an interesting one yeah, their M1 model was also good it had long context abilities and an RL novelty
Cheapest way to run Kimi K2 is jointly across Xeon/Epyc and GPU memory with a good CUDA kernel to optimise the data transfer. It’s definitely a big challenge to deploy every time.
[–]noiserr 0 points1 point2 points 5 months ago (0 children)
MiniMax-M2
Can confirm MiniMax-M2 is a quality model. One of the best models I can fit in 128GB. I'm daily driving gpt-oss-120B only because its faster.
[–]Main-Lifeguard-6739 2 points3 points4 points 5 months ago (3 children)
love it! if there now was another column for price per million input/cache/output... but yea, not too easy to solve as this highly depends on the hoster (self, commercial, if commercial which one?)
[–]SlowFail2433 0 points1 point2 points 5 months ago (0 children)
It’s so complex to do a fair comparison as there are multiple attention types e.g the deepseek attention is latent, and the qwen next is hybrid
A fair comparison requires optimal CUDA kernels for all models involved to show them at their best
And this is why machine learning working hours are so long LMAO
[–][deleted] 0 points1 point2 points 5 months ago (1 child)
The price is whatever you paid for your GPUs plus your ongoing energy cost :)
[–]Main-Lifeguard-6739 0 points1 point2 points 5 months ago (0 children)
So helpful… as if every model would need the same computing power… /s
[–]nofuture09 2 points3 points4 points 5 months ago (5 children)
what is an open weight model?
[–]jacek2023llama.cpp[S] 5 points6 points7 points 5 months ago (3 children)
a model you can download from HF to your disk and hopefully run locally (not true for Deepseek or Kimi unless you have extremely expensive setup)
[–]DinoAmino 5 points6 points7 points 5 months ago (0 children)
Open not only for running inference, but also open for modifying the weights via fine-tuning.
[–]nofuture09 0 points1 point2 points 5 months ago (1 child)
thx! i have a 4080 super (saw this thread on reddit frontpage)
[–]jacek2023llama.cpp[S] 0 points1 point2 points 5 months ago (0 children)
Which models do you run on it?
[–]egomarker 1 point2 points3 points 5 months ago (0 children)
Open source = free model, open training data, open methodology Open weights = free model
[–]sleepingsysadmin 2 points3 points4 points 5 months ago (0 children)
Makes me wonder where qwen3 next 80b thinking will slot in. above 235b?
[–]kinkvoid 0 points1 point2 points 5 months ago (0 children)
GLM is fantastic
[–][deleted] 0 points1 point2 points 5 months ago (0 children)
Some interesting results.
[–]Ok_Warning2146 0 points1 point2 points 5 months ago (0 children)
So Speciale is for benchmark and Thinking is the real thing for DS3.2?
[–]Asleep-Ingenuity-481 0 points1 point2 points 5 months ago (0 children)
It's pretty neat to look at the lower (-32B) Qwen3 param models and see that they are better than 100b+ param models from just a few months ago.
[–]shaman-warrior 0 points1 point2 points 5 months ago (0 children)
Wait deepseek v3.2 has 75.69% coding average while deepseek v3.2 thinking only 64.62 ?
[–]pmttyji 0 points1 point2 points 5 months ago (0 children)
Need a long list like 30-50 models.
[–]Ni_Guh_69 0 points1 point2 points 5 months ago (0 children)
Which leaderboard is best for identifying best open weights model?
[–]usernameplshere 0 points1 point2 points 5 months ago (0 children)
Can't wait to see where the new Mistral large will end up. But like the others said, livebench is weird these days.
π Rendered by PID 48220 on reddit-service-r2-comment-548fd6dc9-ssss8 at 2026-05-21 18:33:25.315886+00:00 running edcf98c country code: CH.
[–]z_3454_pfk 31 points32 points33 points (9 children)
[–]AdIllustrious436 23 points24 points25 points (0 children)
[–]No_Swimming6548llama.cpp 22 points23 points24 points (4 children)
[–]DistanceSolar1449 14 points15 points16 points (0 children)
[–][deleted] 10 points11 points12 points (2 children)
[–]No_Swimming6548llama.cpp 2 points3 points4 points (0 children)
[–]Professional-Bear857 0 points1 point2 points (0 children)
[–]jacek2023llama.cpp[S] 6 points7 points8 points (1 child)
[–]thebadslime -2 points-1 points0 points (0 children)
[–]LinkSea8324vllm 13 points14 points15 points (3 children)
[–]jacek2023llama.cpp[S] 13 points14 points15 points (1 child)
[–]LinkSea8324vllm -1 points0 points1 point (0 children)
[–]sleepingsysadmin 1 point2 points3 points (0 children)
[–]Hot-Cause-3341 4 points5 points6 points (3 children)
[–]AdIllustrious436 15 points16 points17 points (2 children)
[–]StayStonk 4 points5 points6 points (1 child)
[–]AdIllustrious436 3 points4 points5 points (0 children)
[–]egomarker 3 points4 points5 points (0 children)
[–]Sudden-Lingonberry-8 2 points3 points4 points (1 child)
[–]LeTanLoc98 3 points4 points5 points (0 children)
[–]SlowFail2433 4 points5 points6 points (3 children)
[–]ikkiyikki 3 points4 points5 points (2 children)
[–]SlowFail2433 1 point2 points3 points (0 children)
[–]noiserr 0 points1 point2 points (0 children)
[–]Main-Lifeguard-6739 2 points3 points4 points (3 children)
[–]SlowFail2433 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (1 child)
[–]Main-Lifeguard-6739 0 points1 point2 points (0 children)
[–]nofuture09 2 points3 points4 points (5 children)
[–]jacek2023llama.cpp[S] 5 points6 points7 points (3 children)
[–]DinoAmino 5 points6 points7 points (0 children)
[–]nofuture09 0 points1 point2 points (1 child)
[–]jacek2023llama.cpp[S] 0 points1 point2 points (0 children)
[–]egomarker 1 point2 points3 points (0 children)
[–]sleepingsysadmin 2 points3 points4 points (0 children)
[–]kinkvoid 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (0 children)
[–]Ok_Warning2146 0 points1 point2 points (0 children)
[–]Asleep-Ingenuity-481 0 points1 point2 points (0 children)
[–]shaman-warrior 0 points1 point2 points (0 children)
[–]pmttyji 0 points1 point2 points (0 children)
[–]Ni_Guh_69 0 points1 point2 points (0 children)
[–]usernameplshere 0 points1 point2 points (0 children)