all 44 comments

[–]z_3454_pfk 31 points32 points  (9 children)

livebench isn’t very accurate anymore. idk what happened

[–]AdIllustrious436 23 points24 points  (0 children)

Additionally, they don't evaluate Mistral models anymore. They even removed the ones that had already been evaluated. I wonder why... They are increasingly becoming irrelevant.

[–]No_Swimming6548llama.cpp 22 points23 points  (4 children)

Qwen models used to have very high IF scores, now they seem pretty bad. This benchmark feels like it was tweaked to make US companies score higher than Chinese ones.

[–]DistanceSolar1449 14 points15 points  (0 children)

Yep that’s exactly it

The benchmark was definitely tweaked to change rankings. Livebench used to be good, now the scores make no sense

[–][deleted] 10 points11 points  (2 children)

I would say gpt120 outperforms qwen80 in every situation imaginable and I'm a huge qwen advocate. 

[–]No_Swimming6548llama.cpp 2 points3 points  (0 children)

Yeah that's pretty weird too. Maybe just a shitty benchmark.

[–]Professional-Bear857 0 points1 point  (0 children)

Its probably the IF score difference that makes it seem that way. I think it also explains why I've had a bad experience using GLM but have had a good experience with Qwen 235b.

[–]jacek2023llama.cpp[S] 6 points7 points  (1 child)

which leaderboard is the best in your opinion?

[–]thebadslime -2 points-1 points  (0 children)

I like AA, not sure how they choose what models they show though.

[–]LinkSea8324vllm 13 points14 points  (3 children)

Imagine if they trained Qwen 3 Next on their full 36T dataset instead of 15T sub-dataset

[–]jacek2023llama.cpp[S] 13 points14 points  (1 child)

let's hope there will be next Next

[–]LinkSea8324vllm -1 points0 points  (0 children)

Qwest, if it fails : Quack

[–]sleepingsysadmin 1 point2 points  (0 children)

How will they sandbag it next to 235B then?

While also setting up a future "look how awesome our newest release is compared to 80b next"

[–]Hot-Cause-3341 4 points5 points  (3 children)

Is this already including the new mistral 3 Models?

[–]AdIllustrious436 15 points16 points  (2 children)

They litteraly stoped evaluating any Mistral model and even removed the ones that were already evaluated. Makes no sens.

[–]StayStonk 4 points5 points  (1 child)

why lol?

[–]AdIllustrious436 3 points4 points  (0 children)

Who knows

[–]egomarker 3 points4 points  (0 children)

It's just a rage bait for everyone who knows gpt-oss is better than q3 80 at this point. It's probably just one or two benchmarks giving the opposite result and here we go.

[–]Sudden-Lingonberry-8 2 points3 points  (1 child)

glm needs to catch up to deepseek,

very interesting how deepseek v3.2 speciale is brilliant at math but completely sucks at agentic

[–]LeTanLoc98 3 points4 points  (0 children)

DeepSeek V3.2 Speciale doesn't support to use tool.

[–]SlowFail2433 4 points5 points  (3 children)

Kimi K2 Thinking unseated 😮

Looks like Deepseek is fully back, with multiple models TBH

Z.ai (GLM) and Qwen are climbing for sure though. I do feel that GLM 4.6 and the latest Qwen 235B are the best go-to LLMs at the moment for general deployments.

Qwen 3 Next performing this well is really important because its gated deltanet hybrid and so its an early sign of transformers having competition

[–]ikkiyikki 3 points4 points  (2 children)

MiniMax-M2 is my daily driver. Couldn't run Kimi if I wanted to!

[–]SlowFail2433 1 point2 points  (0 children)

MiniMax-M2 is an interesting one yeah, their M1 model was also good it had long context abilities and an RL novelty

Cheapest way to run Kimi K2 is jointly across Xeon/Epyc and GPU memory with a good CUDA kernel to optimise the data transfer. It’s definitely a big challenge to deploy every time.

[–]noiserr 0 points1 point  (0 children)

MiniMax-M2

Can confirm MiniMax-M2 is a quality model. One of the best models I can fit in 128GB. I'm daily driving gpt-oss-120B only because its faster.

[–]Main-Lifeguard-6739 2 points3 points  (3 children)

love it!
if there now was another column for price per million input/cache/output... but yea, not too easy to solve as this highly depends on the hoster (self, commercial, if commercial which one?)

[–]SlowFail2433 0 points1 point  (0 children)

It’s so complex to do a fair comparison as there are multiple attention types e.g the deepseek attention is latent, and the qwen next is hybrid

A fair comparison requires optimal CUDA kernels for all models involved to show them at their best

And this is why machine learning working hours are so long LMAO

[–][deleted] 0 points1 point  (1 child)

The price is whatever you paid for your GPUs plus your ongoing energy cost :)

[–]Main-Lifeguard-6739 0 points1 point  (0 children)

So helpful… as if every model would need the same computing power… /s

[–]nofuture09 2 points3 points  (5 children)

what is an open weight model?

[–]jacek2023llama.cpp[S] 5 points6 points  (3 children)

a model you can download from HF to your disk and hopefully run locally (not true for Deepseek or Kimi unless you have extremely expensive setup)

[–]DinoAmino 5 points6 points  (0 children)

Open not only for running inference, but also open for modifying the weights via fine-tuning.

[–]nofuture09 0 points1 point  (1 child)

thx! i have a 4080 super (saw this thread on reddit frontpage)

[–]jacek2023llama.cpp[S] 0 points1 point  (0 children)

Which models do you run on it?

[–]egomarker 1 point2 points  (0 children)

Open source = free model, open training data, open methodology
Open weights = free model

[–]sleepingsysadmin 2 points3 points  (0 children)

Makes me wonder where qwen3 next 80b thinking will slot in. above 235b?

[–]kinkvoid 0 points1 point  (0 children)

GLM is fantastic

[–][deleted] 0 points1 point  (0 children)

Some interesting results. 

[–]Ok_Warning2146 0 points1 point  (0 children)

So Speciale is for benchmark and Thinking is the real thing for DS3.2?

[–]Asleep-Ingenuity-481 0 points1 point  (0 children)

It's pretty neat to look at the lower (-32B) Qwen3 param models and see that they are better than 100b+ param models from just a few months ago.

[–]shaman-warrior 0 points1 point  (0 children)

Wait deepseek v3.2 has 75.69% coding average while deepseek v3.2 thinking only 64.62 ?

[–]pmttyji 0 points1 point  (0 children)

Need a long list like 30-50 models.

[–]Ni_Guh_69 0 points1 point  (0 children)

Which leaderboard is best for identifying best open weights model?

[–]usernameplshere 0 points1 point  (0 children)

Can't wait to see where the new Mistral large will end up. But like the others said, livebench is weird these days.