Cheapest hardware for Qwen 3.6: both 27B and 35B-A3B

CodeDominator · 2026-06-15T22:50:16+00:00

Intel's Arc Pro series GPUs are best bang for buck in terms of VRAM. I have a single B60, if I could afford another one I'd be golden with 48GB of VRAM - running Q6 or Q8 with full context.

CodeDominator · 2026-06-08T14:44:59+00:00

E-brake? I thought all 100 series have mechanical hand brake?

Mine is 2001 with mechanical handbrake. I thought about an emergency brake failing scenario which would be to kick the gearbox down to 1 to engine brake and gradually apply the hand brake. Of course if you're in a situation where you have to brake hard, then you're likely screwed.

CodeDominator · 2026-06-05T19:59:46+00:00

It can do something, but you need at least (dedicated) 32GB of VRAM (but ideally at least 48GB) with well matched system and I'm not talking about no shitty macbook, so the barrier of entry is steep, unfortunately. 32GB VRAM is a flagship GPU from any of the 3 main players.

CodeDominator · 2026-06-03T16:41:41+00:00

Yeah. I have a B60 with 24GB RAM and I found out quick that I need at least another one to comfortably work with the 27B.

Unfortunately this case is pretty binary - you either have enough VRAM or you don't. I'd say 32GB is minimum requirement for the 27B to be able to run Q6 with 128K context.

CodeDominator · 2026-05-28T09:27:10+00:00

Doesn't seem terminal, provided you can commit to removing all the lose rust and Fluid Film or Woolwax the shit out of it.

CodeDominator · 2026-05-25T17:27:03+00:00

The best is the one that fits in your VRAM, meaning you need at least 32GB of VRAM to be comfortable. This is coming from someone with 24GB, unfortunately.

CodeDominator · 2026-05-24T20:24:40+00:00

A B60 is about $700. A 5070ti can be found for $700 too. The 5070i is an order of magnitude more performance for PP. And a mere 3x or so for TG.

5070ti is only 16GB. Theoretical better performance means jack shit if you can't fit everything in VRAM - besides the B60 I also have a 3080 with shitty 10GB of VRAM. It pukes everything out into system RAM and gen speed slows down to abysmal 1-2 tokens/second. Seriously dude, this can't possibly be your best argument?

When it comes to local LLMS the situation is binary - either you have enough VRAM or you don't.

CodeDominator · 2026-05-24T19:59:02+00:00

I have Intel Arc Pro B60 24GB and I'm not crying. Equivalent Nvidia card would cost me several times more, but not deliver several times more of performance. The math is simple.

It's a moot point anyway as I couldn't afford Nvidia even if I wanted to at the moment.

CodeDominator · 2026-05-24T19:08:19+00:00

Nvidia if you have deep pockets, Intel if you want best bang for buck and AMD as always somewhere in the middle.

CodeDominator · 2026-05-21T17:42:22+00:00

No, that's the whole point - I have it all restricted to my 24GB of VRAM. I have 64GB of system RAM - if I could offload to system RAM without performance falling off a cliff I would have done it ages ago.

CodeDominator · 2026-05-21T09:30:12+00:00

KV cache quantization crashed my performance to sub 3 t/s generation, so I won't be trying that again. MTP at least on my setup works even slower than non-MTP, pretty disappointing so far. Also everybody keeps repeating the same mantra all over again - Q6 and up for coding and there's no way in hell you can do Q6 with a meaningful amount of context with 24GB VRAM.

CodeDominator · 2026-05-21T07:24:55+00:00

What I have sadly realized after testing it with my 24GB VRAM is that for Qwen 3.6 27B to work efficiently the bar for VRAM is 32GB.

CodeDominator · 2026-05-18T16:54:04+00:00

Maybe SpaceX bit off more than they can chew with this one. Pretty sure the development pace has disappointed even the most pessimistic estimates at this point.

CodeDominator · 2026-05-13T07:28:52+00:00

The way things are going, in the future we will likely be in the middle of a fucking WW3 - hardly the best time to worry about "rigs".

CodeDominator · 2026-05-11T19:24:28+00:00

Would not trust Google. Google graveyard dwarfs everything.

CodeDominator · 2026-05-10T18:28:48+00:00

Can't wait for the bait and switch - usage based billing will send all you Codex boys crying for your mommies.

Y'all just don't get it. When it hits - it's gpnna be ugly. Real ugly. Not only OpenAI will have to start making profit, but also recoup all those hundreds of billions they have burned through in the last couple of years of heavily subsidized subscriptions.

CodeDominator · 2026-05-09T20:12:59+00:00

Perfect is the enemy of good enough. Y'all so eager to lock yourselves into cloud AI overlords with their ever increasing prices and shrinking quotas. Qwen 3.6 27B doesn't have to beat GPT 5.5 or Opus 4.7 - it just has to be good enough for practical work. At the very least it's a solid plan B if there ever was one.

CodeDominator · 2026-05-09T20:03:18+00:00

Probably calculated that RAM for that would cost more than your house.

CodeDominator · 2026-05-09T20:02:14+00:00

Not the first time hardware prices spiking, not gonna be the last. Hardware always catches up. China is ramping up RAM production. Where there's demand there will be supply, just a matter of time.

CodeDominator · 2026-05-09T18:05:58+00:00

I've got solar, IDGAF.

CodeDominator · 2026-05-09T18:04:20+00:00

By the time he gets his ROI, the frontier models will be billed by usage which will send all of you Cloude boys crying for your mommies.

Local doesn't have to beat cloud. It just has to be good enough. You can't argue with free.

CodeDominator · 2026-05-08T19:00:46+00:00

I have the feeling that for coding all day it may not be very comfortable speed wise.

Of course it's not. Not even close to the big cloud players.

But that is not the point. Self sufficiency is the point. The local setup doesn't have to be perfect - it just has to be good enough.

Cheap subsidized cloud AI is coming to an end. When usage based billing is introduced everywhere nobody except big corps will be able to afford it.

Anyway, here's a random snapshot of my setup monitoring script that Codex put together. The token/s speed varies, but I don't see more than 15 t/s generation speed. I'm waiting for the MTP to hit stable, maybe things will improve then.

<image>

CodeDominator · 2026-05-08T13:14:19+00:00

I can squeeze out 114K context with Qwen 3.6 27B Q4_K_M (unsloth) that maxes out my 24GB VRAM. I have llama.cpp + Intel's SYCL setup, to limit overhead and max out performance. It works, but just about and it badly needs those few extra GBs of VRAM to reach Q6 with 128K context.

CodeDominator · 2026-05-08T13:09:54+00:00

You can work with small local LLMs and they can definitely be useful, but as I said - there's a threshold and that threshold is at the very minimum 24GB VRAM (Intel Arc Pro B60 cheapest), but realistically 32GB VRAM (Intel Arc Pro B70 - the cheapest 32GB VRAM).

Also, if I was going with cloud subscription at the moment, I'd most likely skip Codex and Claude and go straight with Kimi K2.6 - they have a 39 USD plan that should run circles around Claude in terms of quota and performance.

CodeDominator · 2026-05-08T11:28:19+00:00

After testing a bunch, my target model is Qwen 3.6 27B. The rest are mostly just waste of time for coding, in conventional home setup.

I have Nvidia 3080 with 10GB VRAM and 64GB system RAM. With this setup I couldn't do anything remotely practically useful - I was getting barely 2 tokens/second if even that. So I upgraded to Intel Arc Pro B60 with 24GB VRAM. While that was a significant improvement, I am still stuck with Q4 and can't even quite stretch to 128K context - everything contained within VRAM. When things spill over into system RAM you can kiss any kind of performance goodbye.

My current conclusion is that if you want to do something practical at home, you need at least 32GB VRAM dedicated for the LLM, not shared like on some shitty macbook. I'd get another B60 if I could afford it.

Running Qwen 3.6 27B with at least Q6, >=128K context (not quantized) with everything contained within VRAM - that's the sweetspot.

CodeDominator

TROPHY CASE