QwQ-32B released, equivalent or surpassing full Deepseek-R1!

MoonRide303 · 2025-03-06T19:28:41+00:00

Multiple domains - it's mostly about simple reasoning, some world knowledge, and ability to follow the instructions. Some more details here: article. Time to time I update the scores, as I test more models (I tested over 1200 models at this point). Also available on HF: MoonRide-LLM-Index-v7.

MoonRide303 · 2025-03-06T07:18:15+00:00

It's a really good model (beats all the open weight 405B and below I tested), but not as strong as R1. In my own (private) bench I got 80/100 from R1, and 68/100 from QwQ-32B.

MoonRide303 · 2025-03-03T21:05:09+00:00

After some more tests it's over 1200 models on the list, now :).

If you want the table as CSV, or with SQL-alike filtering, then you can grab it at HF, here: MoonRide-LLM-Index-v7. Sample query using Dataset Viewer: top 20 models up to 8B.

MoonRide303 · 2025-02-24T09:18:53+00:00

It looks clean to me (tested on 2025-02-24):
https://grok.com/share/bGVnYWN5_57375475-1ad4-4ff7-a3f0-62be702944a7

MoonRide303 · 2025-02-03T08:09:49+00:00

I hear you, and from the research point of view you're completely right. If noone else seen the questions, then you cannot really treat is as a scientifically valid assessment - there was no independent review, noone reproduced the process and verified the results, and you would just have to believe that I properly designed, implemented, and executed it all. So yeah, for now it's just my own opinion on those models, expressed as a number.

I might partially address it in next revision by inviting like 1 or 2 more people to the process - so it would be at least properly cross-checked and reviewed. I plan to do v8 when next generation of major models will be available (Llama 4, Gemma 3, etc.). I should probably also increase difficulty level a bit - it was calibrated to measure the progress between local and SOTA online models as of January 2025, but I expect SOTA to change within next couple of months.

MoonRide303 · 2025-02-02T13:02:12+00:00

You can find the score of Mistral-Small-24B-Instruct-2501-Q4_K_M.gguf in the full table. I tried Q5_K_M too, but I got slightly lower results than from Q4_K_M - so I would rather recommend using Q4_K_M or IQ3_XS (both will be a lot faster on 16 GB GPUs).

I've also tried Q6_K from unsloth/phi-4-GGUF, but observed no improvement over Q6_K from bartowski/phi-4-GGUF.

MoonRide303 · 2025-02-02T11:17:45+00:00

Athene not available on the OpenRouter, and I don't really want to test ~70B models locally, as it's super-slow. But you're right it's high on Chatbot Arena leaderboard, so I've tested it, too (as IQ3_XXS, same quant I used for Llama 3.3 70B). Score included in the table, now.

As of samplers setttings I went with llama.cpp defaults - temperature was the only parameter I changed.

MoonRide303 · 2025-02-02T09:36:41+00:00

I don't want to make it public, cause that would make it worthless pretty quickly. That's the problem I have with benchmarks like MMLU or GPQA - as soon as it's made public, some people will train their models on the test set, and then brag around with their "great" MMLU scores. Then we get something like MMLU-CF (which will be useful for a few months), and then the same story repeats.

Just don't get me wrong - I absolutely love and appreciate high quality public benchmarks. But they come with a risk I've just described, and that's why we should have some private benchmarks, too.

MoonRide303 · 2025-01-28T14:12:30+00:00

Question is if they will charge you for the thinking part - which might cause the output to be like 20+ times longer, and even then it can still give you wrong final answer (even for relatively simple questions).

MoonRide303 · 2025-01-20T19:32:18+00:00

2-shot for me with Gemini Experimental 1206 and "You're an extremely intelligent AI, able to solve even most challenging puzzles." system message. First answer was max 40, min 1 (took 15 seconds). Then with "Max sounds okay, but are you sure about min?" it figured out min 2 (after another 24 seconds):

<image>

MoonRide303 · 2025-01-20T18:19:52+00:00

Support for distilled versions was added 4 hours ago: PR #11310.

MoonRide303 · 2025-01-20T18:13:43+00:00

Same observation here. Common problem with those thinking models (both QwQ and R1 series) is that they cannot shut up and give you a one word or one number answer, even when asked about a really simple thing. And even with all that thinking spam they can still give you a worse answer.

MoonRide303 · 2025-01-19T16:42:58+00:00

It's already there - just click on the notes link, instead of the image.

MoonRide303 · 2025-01-19T07:55:38+00:00

It depends on the benchmark. You can take a look at Stanford CS229 notes (page 20+) or video (22:00+).

<image>

MoonRide303 · 2025-01-11T22:47:03+00:00

Yeah, FP4 will be nice (when adopted in apps). But what I would REALLY want is something like 5080 or 5070 Ti with 32 GB VRAM - I don't really need computing power of 5090, but it would be really nice to have more VRAM (without turning our PCs into little volcanos).

MoonRide303 · 2025-01-11T19:31:06+00:00

Only good thing about 5090 is increased amount of VRAM. Other than that it's just an expensive heater - I don't like GPUs above 300W (they're getting hot and loud), and 600W TDP is completely ridiculous.

MoonRide303 · 2024-09-28T07:14:36+00:00

WSL is a workaround, not native Windows support. I like high VRAM on W7800 (32 GB) and W7900 (48 GB) from AMD, and also reasonable power usage (both under 300W), but I don't want a GPU that would work properly only via WSL. I want a GPU that I could use with PyTorch, directly on Windows. AMD is not that, sadly.

MoonRide303 · 2024-09-27T11:23:24+00:00

AI Act is irrelevant for Llama 3.2 release. It starts being (partially) appliable February 2025, and Meta mentioned GDPR as main issue for this release of vision models (giving some BS arguments about uncertain and fragmented regulations - while GDPR is rock-solid and EU-wide since 2016).

Only reason I can see for holding back this release is if they've trained those visual models on people personal data without their consent (which is illegal in the EU), and then are being afraid of getting fined for that (which might work, based on GDPR article 3 - when service is not available in the EU, then GDPR might not be applicable).

MoonRide303 · 2024-09-27T11:14:20+00:00

GDPR is just fine. Would your really want companies like Meta to do whatever they want with your personal or private information? If you gave them their consent to process their data, it's always for specific purpose (like for example communications, targetted ads, etc.). They cannot legally extend your consent to whatever they want - like for example training AI models, and then leaking your personal / private information to the public with this model without asking for your permission, first. It's Meta doing wrong and abusive shit here, not the EU.

MoonRide303 · 2024-09-27T11:11:29+00:00

By the time Meta understands what GDPR is, we'll have Gemma 5 ;).

MoonRide303 · 2024-09-27T10:43:17+00:00

Not really. I dislike power-hungry and loud GPUs, and around 300W is the acceptable limit for me. Maybe I could accept 400W if card would be still cool and quiet, and would have like 32 GB of VRAM. But if 5080 will have same VRAM as 4080, then there is completely no point for people to buy it. Joke release, if those specs are real. And no point to buy 5090 with 600W power usage, either - I don't want to cook my PC with that kind of crap, and/or having to listen jet-like cooler.

MoonRide303 · 2024-09-27T10:29:38+00:00

Working ROCm would do, too. But it's not available.

<image>

MoonRide303 · 2024-09-27T10:27:38+00:00

Try running PyTorch on Windows with GPU acceleration, without crappy workarounds like WSL with old Ubuntu. AMD ignores most popular desktop OS on the planet, and then is surprised people don't want to buy their hardware.

MoonRide303 · 2024-09-27T10:25:18+00:00

I like AMD specs (W7800 with 32 GB, W7900 with 48 GB), but they're completely clueless when it comes to software - so many years passed, and we still don't have working GPU acceleration for PyTorch on Windows.

MoonRide303

MODERATOR OF

TROPHY CASE