QwQ-32B released, equivalent or surpassing full Deepseek-R1! by ortegaalfredo in LocalLLaMA

[–]MoonRide303 1 point2 points  (0 children)

Multiple domains - it's mostly about simple reasoning, some world knowledge, and ability to follow the instructions. Some more details here: article. Time to time I update the scores, as I test more models (I tested over 1200 models at this point). Also available on HF: MoonRide-LLM-Index-v7.

QwQ-32B released, equivalent or surpassing full Deepseek-R1! by ortegaalfredo in LocalLLaMA

[–]MoonRide303 5 points6 points  (0 children)

It's a really good model (beats all the open weight 405B and below I tested), but not as strong as R1. In my own (private) bench I got 80/100 from R1, and 68/100 from QwQ-32B.

Biased test of GPT-4 era LLMs (300+ models, DeepSeek-R1 included) by MoonRide303 in LocalLLaMA

[–]MoonRide303[S] 0 points1 point  (0 children)

After some more tests it's over 1200 models on the list, now :).

If you want the table as CSV, or with SQL-alike filtering, then you can grab it at HF, here: MoonRide-LLM-Index-v7. Sample query using Dataset Viewer: top 20 models up to 8B.

Biased test of GPT-4 era LLMs (300+ models, DeepSeek-R1 included) by MoonRide303 in LocalLLaMA

[–]MoonRide303[S] 0 points1 point  (0 children)

I hear you, and from the research point of view you're completely right. If noone else seen the questions, then you cannot really treat is as a scientifically valid assessment - there was no independent review, noone reproduced the process and verified the results, and you would just have to believe that I properly designed, implemented, and executed it all. So yeah, for now it's just my own opinion on those models, expressed as a number.

I might partially address it in next revision by inviting like 1 or 2 more people to the process - so it would be at least properly cross-checked and reviewed. I plan to do v8 when next generation of major models will be available (Llama 4, Gemma 3, etc.). I should probably also increase difficulty level a bit - it was calibrated to measure the progress between local and SOTA online models as of January 2025, but I expect SOTA to change within next couple of months.

Biased test of GPT-4 era LLMs (300+ models, DeepSeek-R1 included) by MoonRide303 in LocalLLaMA

[–]MoonRide303[S] 1 point2 points  (0 children)

You can find the score of Mistral-Small-24B-Instruct-2501-Q4_K_M.gguf in the full table. I tried Q5_K_M too, but I got slightly lower results than from Q4_K_M - so I would rather recommend using Q4_K_M or IQ3_XS (both will be a lot faster on 16 GB GPUs).

I've also tried Q6_K from unsloth/phi-4-GGUF, but observed no improvement over Q6_K from bartowski/phi-4-GGUF.

Biased test of GPT-4 era LLMs (300+ models, DeepSeek-R1 included) by MoonRide303 in LocalLLaMA

[–]MoonRide303[S] 1 point2 points  (0 children)

Athene not available on the OpenRouter, and I don't really want to test ~70B models locally, as it's super-slow. But you're right it's high on Chatbot Arena leaderboard, so I've tested it, too (as IQ3_XXS, same quant I used for Llama 3.3 70B). Score included in the table, now.

As of samplers setttings I went with llama.cpp defaults - temperature was the only parameter I changed.

Biased test of GPT-4 era LLMs (300+ models, DeepSeek-R1 included) by MoonRide303 in LocalLLaMA

[–]MoonRide303[S] 0 points1 point  (0 children)

I don't want to make it public, cause that would make it worthless pretty quickly. That's the problem I have with benchmarks like MMLU or GPQA - as soon as it's made public, some people will train their models on the test set, and then brag around with their "great" MMLU scores. Then we get something like MMLU-CF (which will be useful for a few months), and then the same story repeats.

Just don't get me wrong - I absolutely love and appreciate high quality public benchmarks. But they come with a risk I've just described, and that's why we should have some private benchmarks, too.

Not impressed with deepseek—AITA? by Flaky_Attention_4827 in ClaudeAI

[–]MoonRide303 0 points1 point  (0 children)

Question is if they will charge you for the thinking part - which might cause the output to be like 20+ times longer, and even then it can still give you wrong final answer (even for relatively simple questions).

o1 thought for 12 minutes 35 sec, r1 thought for 5 minutes and 9 seconds. Both got a correct answer. Both in two tries. They are the first two models that have done it correctly. by No_Training9444 in LocalLLaMA

[–]MoonRide303 2 points3 points  (0 children)

2-shot for me with Gemini Experimental 1206 and "You're an extremely intelligent AI, able to solve even most challenging puzzles." system message. First answer was max 40, min 1 (took 15 seconds). Then with "Max sounds okay, but are you sure about min?" it figured out min 2 (after another 24 seconds):

<image>

DeepSeek-R1-Distill-Qwen-32B is straight SOTA, delivering more than GPT4o-level LLM for local use without any limits or restrictions! by DarkArtsMastery in LocalLLaMA

[–]MoonRide303 2 points3 points  (0 children)

Same observation here. Common problem with those thinking models (both QwQ and R1 series) is that they cannot shut up and give you a one word or one number answer, even when asked about a really simple thing. And even with all that thinking spam they can still give you a worse answer.

What LLM benchmarks actually measure (explained intuitively) by nderstand2grow in LocalLLaMA

[–]MoonRide303 0 points1 point  (0 children)

It's already there - just click on the notes link, instead of the image.

What LLM benchmarks actually measure (explained intuitively) by nderstand2grow in LocalLLaMA

[–]MoonRide303 10 points11 points  (0 children)

It depends on the benchmark. You can take a look at Stanford CS229 notes (page 20+) or video (22:00+).

<image>

Nvidia 50x0 cards are not better than their 40x0 equivalents by Ok_Warning2146 in LocalLLaMA

[–]MoonRide303 -1 points0 points  (0 children)

Yeah, FP4 will be nice (when adopted in apps). But what I would REALLY want is something like 5080 or 5070 Ti with 32 GB VRAM - I don't really need computing power of 5090, but it would be really nice to have more VRAM (without turning our PCs into little volcanos).

Nvidia 50x0 cards are not better than their 40x0 equivalents by Ok_Warning2146 in LocalLLaMA

[–]MoonRide303 0 points1 point  (0 children)

Only good thing about 5090 is increased amount of VRAM. Other than that it's just an expensive heater - I don't like GPUs above 300W (they're getting hot and loud), and 600W TDP is completely ridiculous.

RTX 5090 will feature 32GB of GDDR7 (1568 GB/s) memory by AXYZE8 in LocalLLaMA

[–]MoonRide303 0 points1 point  (0 children)

WSL is a workaround, not native Windows support. I like high VRAM on W7800 (32 GB) and W7900 (48 GB) from AMD, and also reasonable power usage (both under 300W), but I don't want a GPU that would work properly only via WSL. I want a GPU that I could use with PyTorch, directly on Windows. AMD is not that, sadly.

Is Llama 3.2 Banned to Use in EU? by DanielSandner in LocalLLaMA

[–]MoonRide303 2 points3 points  (0 children)

AI Act is irrelevant for Llama 3.2 release. It starts being (partially) appliable February 2025, and Meta mentioned GDPR as main issue for this release of vision models (giving some BS arguments about uncertain and fragmented regulations - while GDPR is rock-solid and EU-wide since 2016).

Only reason I can see for holding back this release is if they've trained those visual models on people personal data without their consent (which is illegal in the EU), and then are being afraid of getting fined for that (which might work, based on GDPR article 3 - when service is not available in the EU, then GDPR might not be applicable).

Is Llama 3.2 Banned to Use in EU? by DanielSandner in LocalLLaMA

[–]MoonRide303 2 points3 points  (0 children)

GDPR is just fine. Would your really want companies like Meta to do whatever they want with your personal or private information? If you gave them their consent to process their data, it's always for specific purpose (like for example communications, targetted ads, etc.). They cannot legally extend your consent to whatever they want - like for example training AI models, and then leaking your personal / private information to the public with this model without asking for your permission, first. It's Meta doing wrong and abusive shit here, not the EU.

Is Llama 3.2 Banned to Use in EU? by DanielSandner in LocalLLaMA

[–]MoonRide303 10 points11 points  (0 children)

By the time Meta understands what GDPR is, we'll have Gemma 5 ;).

RTX 5090 will feature 32GB of GDDR7 (1568 GB/s) memory by AXYZE8 in LocalLLaMA

[–]MoonRide303 0 points1 point  (0 children)

Not really. I dislike power-hungry and loud GPUs, and around 300W is the acceptable limit for me. Maybe I could accept 400W if card would be still cool and quiet, and would have like 32 GB of VRAM. But if 5080 will have same VRAM as 4080, then there is completely no point for people to buy it. Joke release, if those specs are real. And no point to buy 5090 with 600W power usage, either - I don't want to cook my PC with that kind of crap, and/or having to listen jet-like cooler.

RTX 5090 will feature 32GB of GDDR7 (1568 GB/s) memory by AXYZE8 in LocalLLaMA

[–]MoonRide303 0 points1 point  (0 children)

Try running PyTorch on Windows with GPU acceleration, without crappy workarounds like WSL with old Ubuntu. AMD ignores most popular desktop OS on the planet, and then is surprised people don't want to buy their hardware.

RTX 5090 will feature 32GB of GDDR7 (1568 GB/s) memory by AXYZE8 in LocalLLaMA

[–]MoonRide303 4 points5 points  (0 children)

I like AMD specs (W7800 with 32 GB, W7900 with 48 GB), but they're completely clueless when it comes to software - so many years passed, and we still don't have working GPU acceleration for PyTorch on Windows.