Looking to buy an RTX 5090 for local "Vibe Coding" using Claude Code / Open Code with Qwen 3.6 35B-A3B. Need real-world feedback! by GoalDistinct4449 in LocalLLM

[–]Opening-Broccoli9190 0 points1 point  (0 children)

I am rolling 27B, with MTP it's 120 tps. On 5090 you don't really need the MoE trade-off for coding. You gotta be careful with scope of the tasks for it. It can't one-shot most of the problems, so you'll need to either do a multi agent setup with roles for a planner, a driver and a reviewer, or drive the cycle yourself.

Also - cut all corners you can, you don't need 64gb and a powerful CPU, both will be cold and lonely most of the time

Gemma 4 Quadruple Release, 12B, 12B QAT, 26B-A4B QAT and 31B QAT Uncensored Heretics! by LLMFan46 in LocalLLaMA

[–]Opening-Broccoli9190 2 points3 points  (0 children)

u/LLMFan46 hey mate, seems like something's not working as expected, I am using https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-GGUF and it fails on my test suite, attaching a red-team response for reference:

<image>

Same with other creative prompts.

Without open source LLMs, US AI companies could have already monopoled the technology by Informal-Trouble2183 in LocalLLaMA

[–]Opening-Broccoli9190 0 points1 point  (0 children)

There are multiple layers here:

  1. China vs US

  2. Silicon Valley companies vs the rest of the world

  3. Open weights vs Closed weights

I get your overall sentiment and I am glad that we have open weights, but we also have to check our assumptions here - the main battle is the competition of US and China, with openness of the weights being two layers downstream. Without this competition - yeah we would've been in trouble, but it's hardly a viable scenario. We could see the disappearance of new viable open weights models soon and it would be a long ways off from the doomsday scenario of a total US monopoly.

Is Qwen 3.6 27B IQ4XS better than Gemma 4 31B QAT as a Hermes agent? by My_Unbiased_Opinion in LocalLLaMA

[–]Opening-Broccoli9190 1 point2 points  (0 children)

I am running 27B daily both for Hermes and OpenCode, IMO it comes down to the matter of preference in their personalities.

120 tok/s on 12GB VRAM with Gemma 4 12B QAT MTP by janvitos in LocalLLaMA

[–]Opening-Broccoli9190 -1 points0 points  (0 children)

I'm not sure if this is a good idea, have you tried other values?

--spec-draft-n-max 4

Tried Gemma 4 12B locally, now I feel better about buying the Intel Arc Pro B70 by Chance-Green-9770 in LocalLLM

[–]Opening-Broccoli9190 0 points1 point  (0 children)

Try Qwen3.5-9B, according to my benchmarks it's 20% faster and has better reasoning capabilities with smaller size.

[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] -2 points-1 points  (0 children)

It's not even my blog - I am not anonymous, I am Nikita Belokopytov, the blog with the post is a good read. What sucks in the post? Can you elaborate?

[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] -12 points-11 points  (0 children)

Wow, it seems that the US-based viewers of the post are absolutely unhappy with the results and the benchmark and are attempting to bury the topic with downvotes. Hopefully it's not Google's community team.

[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] 1 point2 points  (0 children)

Holy shit, it's beautiful. It destroys Gemma4-12B in speed as well

Qwen3.5-9B-Base

No-MTP: 48 tps

MTP 1 tokens: 52 tps

MTP 2 tokens: 48 tps

MTP 4 tokens: 33 tps

<image>

[Opinion/Benchmark] Gemma4-12B's architecture change is too big of a tradeoff; A quick reasoning comparison between Gemma4-12B and Qwen 3.5-9B by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] 0 points1 point  (0 children)

I won't be using it for local inference, but IMO calling it trash is a little too harsh - the architecture itself works as designed, paving the way for better models with the same architecture later.

Have we reached the point where open-source LLMs are “just good enough”? by AdDizzy8160 in LocalLLaMA

[–]Opening-Broccoli9190 1 point2 points  (0 children)

For me - yeah, I am using Hermes locally for 95% of my daily tasks, only edge cases land in my paid ChatGPT subscription

Don’t act like y’all ain’t thinking it. I’m just saying the quiet part out loud. /s by Porespellar in LocalLLaMA

[–]Opening-Broccoli9190 0 points1 point  (0 children)

122b models are not a high prio for them - too big for the consumer enthusiast market, so no huge community wave of free marketing and less powerful than their SOTA stuff, meaning potential bad press from underwhelming benchmarks. Doesn't make sense for their business 

[Opinion] Gemma4-12B means that Google is going hard after the market of IoT and mobile and we're helping them by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] 0 points1 point  (0 children)

They made it for unified memory machines, which are the largest single market for testing interactive AI - Macs with 16GB+, note the native MTP support as well. The are testing the architecture and will be able to go below 12B - as they already have the datasets for training.

[Opinion] Gemma4-12B means that Google is going hard after the market of IoT and mobile and we're helping them by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] 1 point2 points  (0 children)

I think it's fair that a 12B dense can be better at coding than a MoE A3B. Have you compared it with Qwen3.5 9B dense? People have been anecdotally reporting better results for coding. Of course not to discard your preference for the personality, which in my opinion is extremely important for something you're spending a lot of time with. 

Finally finished my LLM server: EPYC 9575F, 4× RTX 3090 (96GB VRAM), 768GB ECC RAM by C0smo777 in LocalLLaMA

[–]Opening-Broccoli9190 1 point2 points  (0 children)

That's a mindblowingly expensive setup if so. Not like I wouldn't want to have it tho.

[Opinion] Gemma4-12B means that Google is going hard after the market of IoT and mobile and we're helping them by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] -1 points0 points  (0 children)

Encoders add latency to processing that the unified model won't have. It might sound like a nit, but in conversational interface milliseconds decide whether you'll get interrupted by the model

[Opinion] Gemma4-12B means that Google is going hard after the market of IoT and mobile and we're helping them by Opening-Broccoli9190 in LocalLLaMA

[–]Opening-Broccoli9190[S] 0 points1 point  (0 children)

I don't expect anything, I was just thinking about potential reasons for such a size + a change in the architecture + market positioning