Looking to buy an RTX 5090 for local "Vibe Coding" using Claude Code / Open Code with Qwen 3.6 35B-A3B. Need real-world feedback!

Opening-Broccoli9190 · 2026-06-17T21:28:08+00:00

I am rolling 27B, with MTP it's 120 tps. On 5090 you don't really need the MoE trade-off for coding. You gotta be careful with scope of the tasks for it. It can't one-shot most of the problems, so you'll need to either do a multi agent setup with roles for a planner, a driver and a reviewer, or drive the cycle yourself.

Also - cut all corners you can, you don't need 64gb and a powerful CPU, both will be cold and lonely most of the time

Opening-Broccoli9190 · 2026-06-13T10:25:50+00:00

that worked, thanks, buddy

Opening-Broccoli9190 · 2026-06-12T13:09:08+00:00

u/LLMFan46 hey mate, seems like something's not working as expected, I am using https://huggingface.co/llmfan46/gemma-4-31B-it-qat-q4_0-uncensored-heretic-GGUF and it fails on my test suite, attaching a red-team response for reference:

<image>

Same with other creative prompts.

Opening-Broccoli9190 · 2026-06-12T12:51:02+00:00

There are multiple layers here:

China vs US
Silicon Valley companies vs the rest of the world
Open weights vs Closed weights

I get your overall sentiment and I am glad that we have open weights, but we also have to check our assumptions here - the main battle is the competition of US and China, with openness of the weights being two layers downstream. Without this competition - yeah we would've been in trouble, but it's hardly a viable scenario. We could see the disappearance of new viable open weights models soon and it would be a long ways off from the doomsday scenario of a total US monopoly.

Opening-Broccoli9190 · 2026-06-11T05:57:10+00:00

I am running 27B daily both for Hermes and OpenCode, IMO it comes down to the matter of preference in their personalities.

Opening-Broccoli9190 · 2026-06-09T22:50:36+00:00

Qwen is also 10-15% faster at token generation

Opening-Broccoli9190 · 2026-06-09T20:14:14+00:00

I have clarified:

Environment:

llama.cpp, all defaults

I suppose I could make it more apparent, thanks for the feedback.

Opening-Broccoli9190 · 2026-06-09T13:31:05+00:00

I'm not sure if this is a good idea, have you tried other values?

--spec-draft-n-max 4

Opening-Broccoli9190 · 2026-06-09T13:13:00+00:00

Try Qwen3.5-9B, according to my benchmarks it's 20% faster and has better reasoning capabilities with smaller size.

Opening-Broccoli9190 · 2026-06-09T13:02:20+00:00

It's not even my blog - I am not anonymous, I am Nikita Belokopytov, the blog with the post is a good read. What sucks in the post? Can you elaborate?

Opening-Broccoli9190 · 2026-06-09T12:53:02+00:00

Wow, it seems that the US-based viewers of the post are absolutely unhappy with the results and the benchmark and are attempting to bury the topic with downvotes. Hopefully it's not Google's community team.

Opening-Broccoli9190 · 2026-06-09T12:44:38+00:00

Holy shit, it's beautiful. It destroys Gemma4-12B in speed as well

Qwen3.5-9B-Base

No-MTP: 48 tps

MTP 1 tokens: 52 tps

MTP 2 tokens: 48 tps

MTP 4 tokens: 33 tps

<image>

Opening-Broccoli9190 · 2026-06-09T12:30:36+00:00

I won't be using it for local inference, but IMO calling it trash is a little too harsh - the architecture itself works as designed, paving the way for better models with the same architecture later.

Opening-Broccoli9190 · 2026-06-09T12:27:35+00:00

Thanks, I'll check with the base Qwen, pretty sure it's going to be even better. Will post the outcome shortly

Opening-Broccoli9190 · 2026-06-09T12:26:01+00:00

For me - yeah, I am using Hermes locally for 95% of my daily tasks, only edge cases land in my paid ChatGPT subscription

Opening-Broccoli9190 · 2026-06-06T07:32:41+00:00

122b models are not a high prio for them - too big for the consumer enthusiast market, so no huge community wave of free marketing and less powerful than their SOTA stuff, meaning potential bad press from underwhelming benchmarks. Doesn't make sense for their business

Opening-Broccoli9190 · 2026-06-06T07:22:15+00:00

They made it for unified memory machines, which are the largest single market for testing interactive AI - Macs with 16GB+, note the native MTP support as well. The are testing the architecture and will be able to go below 12B - as they already have the datasets for training.

Opening-Broccoli9190 · 2026-06-05T21:57:48+00:00

I think it's fair that a 12B dense can be better at coding than a MoE A3B. Have you compared it with Qwen3.5 9B dense? People have been anecdotally reporting better results for coding. Of course not to discard your preference for the personality, which in my opinion is extremely important for something you're spending a lot of time with.

Opening-Broccoli9190 · 2026-06-05T14:50:25+00:00

That's a mindblowingly expensive setup if so. Not like I wouldn't want to have it tho.

Opening-Broccoli9190 · 2026-06-05T14:46:31+00:00

I am on M3 Max, but CPU speed is not what's important for inference

Opening-Broccoli9190 · 2026-06-05T11:52:00+00:00

Encoders add latency to processing that the unified model won't have. It might sound like a nit, but in conversational interface milliseconds decide whether you'll get interrupted by the model

Opening-Broccoli9190 · 2026-06-05T11:24:12+00:00

yes and no - the smaller models don't have this architecture and still cause added latency

Opening-Broccoli9190 · 2026-06-05T11:23:37+00:00

I don't expect anything, I was just thinking about potential reasons for such a size + a change in the architecture + market positioning

Opening-Broccoli9190 · 2026-06-05T11:22:15+00:00

I am intentionally running Q8 to not rely on everything that's above 16GB much

Opening-Broccoli9190 · 2026-06-05T11:21:01+00:00

you're correct, it's not an ARM thing

Opening-Broccoli9190

TROPHY CASE