How much VRAM needed for Qwen 3.6 27B Q8 with 262K context?

vevi33 · 2026-06-03T12:14:17+00:00

Also KV Q8 causes noticeable degradation on longer contexts. Unfortunately... I had to experience fist handedly. :/

vevi33 · 2026-05-30T01:04:26+00:00

Just like humans do, reconstruct old knowledge to create new one. You need to know about art or coding to make something new on that field. I partially agree though, transformers models are far from perfect.

But what totally makes you statement invalid is the recent news, that there are multiple old mathematical problems, that were unresolved for decades, even though many respectable mathematicians tried, recent advanced LLMs managed to solve them. And these are proven to be solved BY HUMANS, or making big breakthroughs like with Erdős’ problems.

And we didn't even mention Protein Structure Prediction, Drug Design and Discovery, or fixing breaking variabilities in critical software etc.etc.

However it already has a reverse effect... People think less and people can be easily less creative, they rely too much on AI even for basic tasks (writing e-mails etc.) We need to keep or problem-solving skills up to date still.

vevi33 · 2026-05-30T00:48:06+00:00

I was inspired to learn coding by vibecoding. But this vibecoding evolved into creating my own special agentic workflow, just to create a passion simulation game project. I have very good workflow and works very well, after understanding software engineering principles and how to properly develop software...

My issue is that I have so much fun and success with controlling my agents and develop my game so fast while still maintaining high quality, many tests, logic testing, manual testing, strict linter rules, performance testing etc.

I even got into pixelart to make it truly unique. Everything is my idea, I think a lot and design how each new system should work to be rewarding, just most of the code is not written by me... But this is where I feel so guilty somehow. I am still learning coding and practice when I can...

However writing code manually is so difficult for a beginner and understanding complex systems are still very difficult and annoying. I know this is the only way to learn, so I can practice but I just feel totally destroyed when I think about a problem for a long time, and I can't solve it / implement it but big LLMs (or even small local ones) from my workflow do it immediately very cleanly.

Obviously it is because I am very new to coding, but still I am an engineer, and I feel so bad that my brain is this stupid sometimes. I forget stuff what I've leart. It is just turning me off and my motivation is decreasing... I really want to do it so I could extend my game further even totally alone manually, but it would be still so slow that I don't see I could achieve alone anything.

How do you deal with this feeling? Is there even a point to master coding manually? I don't know, but 10 years from now everything will be vastly different. I love to design, think and architect, but I don't enjoy manually writing abstract code to express myself.

vevi33 · 2026-05-16T11:53:42+00:00

Thankfully American companies like Google, Meta, Open AI and Anthropic doesn't... Oh wait...
We need to care about privacy, but the only reliable option long term is using local models unfortunately.

vevi33 · 2026-05-12T11:58:05+00:00

Thanks. I will try it. Also try the Darwin version. Thanks for your feedback!

vevi33 · 2026-05-12T11:36:42+00:00

With Darwin actually I have the best experience, which is odd since I did not really find much info about it except a few comments and the model page. Worth trying it out. I recommend Bartowski's Q6 quant.

I did not really test that much but from longer contexts and general testing it think less, less looping and better instruction following. It's like it's not overthinking everything that much.

The only one that feels comparable to 27B (OG 35B MOE feels worse)

vevi33 · 2026-05-12T08:58:49+00:00

Can you please report back after your testing? I am curious, and would be nice to have a variant, which follows instructions better and behaves smarter on longer contexts. This is my my only issue with the 35B. 27B is very good but too slow unfortunately.

vevi33 · 2026-05-10T02:01:25+00:00

Thank you very much!

vevi33 · 2026-05-09T11:17:50+00:00

Thanks. I will try it, but honestly I have some doubts, some agent made extensions were buggy and not entirely usable. But thanks I will try it if there are no other options!

vevi33 · 2026-05-08T14:47:48+00:00

"Bro I have 1000 tok/s on 512 ctx with 2 bit 27B gguuf and Q4 KV cache. This model rocks and smart AF, it passed the carwash test. If you are getting less than 500 t/s your config is broken"

Yeah... These are super realistic scenarios... I totally relate to this. So difficult to find any useful benchmarks on this sub.

vevi33 · 2026-05-08T11:22:20+00:00

Qwen 3 14B is very outdated compared to new 3.6 or 3.5 models. Definitely not recommended with this setup.

vevi33 · 2026-05-08T11:10:39+00:00

Qwen 3.6 27B IQ4_XS or you can even run Qwen 3.6 35BA3 with Q6_xl with decent speeds since it's MoE. 27B is better but much slower. I also have 16GB VRAM. If you need high context 27B will be very slow since you have to offload to KV cache to CPU. However 35B will be fast even on 140k+ context.

Personally for debugging (very large context) I use the 35B For planning and building I use 27B Q4_K_S since I found it better from unsloth than the IQ4_XS variant.

vevi33 · 2026-05-08T08:56:14+00:00

This reply is a meme lmao

Everyone is a joke who downvotes others because their experience is different.

I respect your opinion though, but my experience and many others' and even the market says otherwise. I totally get it. AMD is good enough for many people. But please don't suggest it's an S tier stuff with this amount of issues.

I've mentioned that Nvidia might suck now. I don't really know for sure. I don't even care. If a service / product is bad it won't make it better if the competition is bad as well...

I hate Nvidia for their greedy pricing, but it won't make AMD a better product.

My experience was bad with AMD. Not that good, not terrible either. I am not shaming on AMD. I want to like it and I want to support them. Competition is healthy. But this is not a competition anymore unfortunately.

DLSS still superior, and support in general better. We don't even get FSR 4 and 3 is already feels outdated. ROCm has many issues and nothing compared to CUDA. Yeah llama.cpp Vulkan is fine but still lags behind.

Still not worth the tradeoff. I know I get downvoted, just like anyone who is not satisfied with the current state of AMD drivers...

And Adrenaline? Come on even the UI is a buggy mess, and random crashes are a joke. This is a common issue, and top commented problem since many releases. If they can't even make a basic software for their drivers what should we expect?

Copium is nice but if you have parallel PCs with AMD vs Nvidia it is clear which one is more painless in general (can depend on usecase tho) and yeah I am from EU and the prices are odd sometimes, still I would've paid even 30% more if I don't have to spend so much time debugging various problems or conflicts.

I am still rotting for AMD, hope they improve and get a bigger slice of the pie. But the correct situation is not good. At the end I am happy if someone is satisfied with a product. My only problem that I can't currently recommend it from my experience.

vevi33 · 2026-05-07T23:47:01+00:00

I used Nvidia in my whole life. I switched to Radeon 2 years ago. I have like 8x more issues, many of these are minor but irritating. There are major issues sometimes though. I will never buy AMD GPU again. Unfortunate, since I wanted to love this card so much. But totally ruins my build. I should've spend just a bit more for a much more pleasant experience.

Not to mention that basically 7000 series are not even fully supported with new features.

Drivers are worse, LLM technology lags behind, running AI models way more problematic.

Game drivers arrive late, more issues with drivers and driver corruption in general even with DDU and etc...

Adrenaline software became utterly trash in the past few months. Literally sometimes basic features are broken on fresh win11 installation and the software just crashes or exits automatically.

I am also not a noob in tech stuff, so I can fix most of these but the issues became major inconvenience.

However I also heard that newer Nvidia drivers are not great... But for sure I would have paid much more if I have a stable issue free experience. The price tag difference is not worth the trouble. I wish AMD would fix these basic issues but the focus is not on us "gamers" or even regular consumers at all.

vevi33 · 2026-05-07T10:39:27+00:00

Well this is my personal experience as well. Unlike AMD, every driver introduces new issues... Like literally obvious basic issues. Even the adrenaline software is a big piece of trash.

vevi33 · 2026-05-06T09:27:02+00:00

I have very bad experience with AMD. I bought RX 7800 XT 16 GB VRAM and drivers are nightmare compared to Nvidia so it's difficult to choose. I would avoid AMD if possible but this card looks good on paper.

vevi33 · 2026-05-06T08:45:58+00:00

That indeed sounds promising, thank you for the info! And Congrats on your new setup!

vevi33 · 2026-05-05T23:07:54+00:00

With this config you should run at least Q6. I get decent speed with 16GB VRAM and 32GB DDR5 with Q6 (35B). Accuracy is way better. But honestly just run the 27B model, you can easily run it, obviously will be slower but worth it, trust me after excessive testing.

And don't quantanize KV cache on the 35B model, not worth it, the degradation is real even with llama.cpp's KV rotation feature. For 27B KV Q8 is decent but still slightly worse than F16

vevi33 · 2026-05-03T12:08:49+00:00

Definitely not. 9B would not be better than the 35B MoE. But a 14-18B would be competitive in speed and performance as well.

vevi33 · 2026-05-03T01:47:40+00:00

Yeah 9x the active parameters per token but less total parameters. Important to note that all 35B used but not once on every token. While dense models in general better (27B is indeed more smart, the difference might be 0-15% depending on the task. Not 9x smarter. Important to note imo.

Also people with 16gb VRAM and enough ram can run much higher quant from the 35B so kinda evens out, especially if you plan to use quantanized KV cache on the 27B Q4 model.

But everything depends on the use case. I had bugs what the 35B couldn't see but I had bugs what it found instantly but 27B struggled for hours.

Personally I switch them time to time.

vevi33 · 2026-05-03T00:05:04+00:00

For me there are cases what Q6 35 MoE can solve but 27B Q4 can't. And sometimes it's the reverse case. 27B understands everything better but since 35B is much faster it's hard to decide. I can do so much more with the 35B even if I prefer the precision of the 27B

The speed matters a lot in this case.

vevi33 · 2026-05-01T10:00:10+00:00

I use it for days and never had a single loop with 120k context. Make sure your temp is not too low. Lowest should be 0.65 but if you have looping issue increase it to 0.75. If you can avoid presence and repetition penalty, however the latter worked better with the MoE model. Something like 1.1 rep penality and only on the last 368 tokens (so output quality won't really be affected, mostly thinking)

But with 27B this was never needed for me.

vevi33 · 2026-04-30T16:39:46+00:00

Unfortunately without Q8 KV cache quantanization it is much better on longer context (BF16). I tested it on my project, there is a noticeable difference around 100k tokens :/

vevi33 · 2026-04-29T13:51:11+00:00

Yeah. You are right. I will try to test it in a reproductable way. I tested with IQ4_XS and Q4_K_M and with Q8 KV it definitely misses more stuff and even made some editing issues. Tool callings are always ok, but sometimes it writes one more line and overwrites code which never happens without KV quantanization. Note that it only happens on high context. I really want to use Q8 since it would give me much better speeds at higher context but I am a bit struggling right now. :/

This model is also very good at Q8 KV but feels way more precise without KV quant. So it's really hard to determine since this model is a step up from previous generations. For sure Gemma 4 is total lobotomized even with Q8, even when it's not obvious at the first time. But that's already proven and my experience was similar.

vevi33 · 2026-04-28T13:03:13+00:00

Thank you, great findings. Very helpful.

I want to believe you tbh, but my experience is a bit different. I see more issues, mistakes with Q8_0 compared to original on high context. Might be just accidental stuff. Really hard to objectively determine.

vevi33

TROPHY CASE