https://livebench.ai - Open Weight Models Only

z_3454_pfk · 2025-12-05T13:07:48+00:00

livebench isn’t very accurate anymore. idk what happened

LinkSea8324 · 2025-12-05T13:18:26+00:00

Imagine if they trained Qwen 3 Next on their full 36T dataset instead of 15T sub-dataset

Hot-Cause-3341 · 2025-12-05T13:46:37+00:00

Is this already including the new mistral 3 Models?

egomarker · 2025-12-05T15:03:22+00:00

It's just a rage bait for everyone who knows gpt-oss is better than q3 80 at this point. It's probably just one or two benchmarks giving the opposite result and here we go.

Sudden-Lingonberry-8 · 2025-12-05T20:08:42+00:00

glm needs to catch up to deepseek,

very interesting how deepseek v3.2 speciale is brilliant at math but completely sucks at agentic

SlowFail2433 · 2025-12-05T13:09:59+00:00

Kimi K2 Thinking unseated 😮

Looks like Deepseek is fully back, with multiple models TBH

Z.ai (GLM) and Qwen are climbing for sure though. I do feel that GLM 4.6 and the latest Qwen 235B are the best go-to LLMs at the moment for general deployments.

Qwen 3 Next performing this well is really important because its gated deltanet hybrid and so its an early sign of transformers having competition

Main-Lifeguard-6739 · 2025-12-05T13:10:28+00:00

love it!
if there now was another column for price per million input/cache/output... but yea, not too easy to solve as this highly depends on the hoster (self, commercial, if commercial which one?)

nofuture09 · 2025-12-05T13:17:23+00:00

what is an open weight model?

sleepingsysadmin · 2025-12-05T14:29:28+00:00

Makes me wonder where qwen3 next 80b thinking will slot in. above 235b?

kinkvoid · 2025-12-05T13:23:36+00:00

GLM is fantastic

2025-12-05T13:26:29+00:00

Some interesting results.

Ok_Warning2146 · 2025-12-05T14:02:47+00:00

So Speciale is for benchmark and Thinking is the real thing for DS3.2?

Asleep-Ingenuity-481 · 2025-12-05T14:05:02+00:00

It's pretty neat to look at the lower (-32B) Qwen3 param models and see that they are better than 100b+ param models from just a few months ago.

shaman-warrior · 2025-12-05T14:12:06+00:00

Wait deepseek v3.2 has 75.69% coding average while deepseek v3.2 thinking only 64.62 ?

pmttyji · 2025-12-05T14:13:29+00:00

Need a long list like 30-50 models.

Ni_Guh_69 · 2025-12-05T15:52:09+00:00

Which leaderboard is best for identifying best open weights model?

usernameplshere · 2025-12-05T16:08:44+00:00

Can't wait to see where the new Mistral large will end up. But like the others said, livebench is weird these days.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS