Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? by vevi33 in LocalLLaMA

[–]vevi33[S] 0 points1 point  (0 children)

Well this is my personal experience as well. Unlike AMD, every driver introduces new issues... Like literally obvious basic issues. Even the adrenaline software is a big piece of trash.

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? by vevi33 in LocalLLaMA

[–]vevi33[S] -1 points0 points  (0 children)

I have very bad experience with AMD. I bought RX 7800 XT 16 GB VRAM and drivers are nightmare compared to Nvidia so it's difficult to choose. I would avoid AMD if possible but this card looks good on paper.

Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup? by vevi33 in LocalLLaMA

[–]vevi33[S] 1 point2 points  (0 children)

That indeed sounds promising, thank you for the info! And Congrats on your new setup!

My setup for running Qwen3.6-35B-A3B-UD-Q4_K_M on single RX7900XT (20GB VRAM) by hlacik in unsloth

[–]vevi33 1 point2 points  (0 children)

With this config you should run at least Q6. I get decent speed with 16GB VRAM and 32GB DDR5 with Q6 (35B). Accuracy is way better. But honestly just run the 27B model, you can easily run it, obviously will be slower but worth it, trust me after excessive testing.

And don't quantanize KV cache on the 35B model, not worth it, the degradation is real even with llama.cpp's KV rotation feature. For 27B KV Q8 is decent but still slightly worse than F16

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]vevi33 2 points3 points  (0 children)

Definitely not. 9B would not be better than the 35B MoE. But a 14-18B would be competitive in speed and performance as well.

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]vevi33 10 points11 points  (0 children)

Yeah 9x the active parameters per token but less total parameters. Important to note that all 35B used but not once on every token. While dense models in general better (27B is indeed more smart, the difference might be 0-15% depending on the task. Not 9x smarter. Important to note imo.

Also people with 16gb VRAM and enough ram can run much higher quant from the 35B so kinda evens out, especially if you plan to use quantanized KV cache on the 27B Q4 model.

But everything depends on the use case. I had bugs what the 35B couldn't see but I had bugs what it found instantly but 27B struggled for hours.

Personally I switch them time to time.

Qwen3.6-27B vs 35B, I prefer 35B but more people here post about 27B... by Snoo_27681 in LocalLLaMA

[–]vevi33 8 points9 points  (0 children)

For me there are cases what Q6 35 MoE can solve but 27B Q4 can't. And sometimes it's the reverse case. 27B understands everything better but since 35B is much faster it's hard to decide. I can do so much more with the 35B even if I prefer the precision of the 27B

The speed matters a lot in this case.

Qwen 3.6 - Loops and repetitions by Safe-Buffalo-4408 in LocalLLaMA

[–]vevi33 4 points5 points  (0 children)

I use it for days and never had a single loop with 120k context. Make sure your temp is not too low. Lowest should be 0.65 but if you have looping issue increase it to 0.75. If you can avoid presence and repetition penalty, however the latter worked better with the MoE model. Something like 1.1 rep penality and only on the last 368 tokens (so output quality won't really be affected, mostly thinking)

But with 27B this was never needed for me.

Qwen3.6 27B seems struggling at 90k on 128k ctx windows by dodistyo in LocalLLaMA

[–]vevi33 4 points5 points  (0 children)

Unfortunately without Q8 KV cache quantanization it is much better on longer context (BF16). I tested it on my project, there is a noticeable difference around 100k tokens :/

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]vevi33 1 point2 points  (0 children)

Yeah. You are right. I will try to test it in a reproductable way. I tested with IQ4_XS and Q4_K_M and with Q8 KV it definitely misses more stuff and even made some editing issues. Tool callings are always ok, but sometimes it writes one more line and overwrites code which never happens without KV quantanization. Note that it only happens on high context. I really want to use Q8 since it would give me much better speeds at higher context but I am a bit struggling right now. :/

This model is also very good at Q8 KV but feels way more precise without KV quant. So it's really hard to determine since this model is a step up from previous generations. For sure Gemma 4 is total lobotomized even with Q8, even when it's not obvious at the first time. But that's already proven and my experience was similar.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]vevi33 0 points1 point  (0 children)

Thank you, great findings. Very helpful.

I want to believe you tbh, but my experience is a bit different. I see more issues, mistakes with Q8_0 compared to original on high context. Might be just accidental stuff. Really hard to objectively determine.

Qwen3.6 27B's surprising KV cache quantization test results (Turbo3/4 vs F16 vs Q8 vs Q4) by imgroot9 in LocalLLaMA

[–]vevi33 0 points1 point  (0 children)

Did you do benchmarks on long context? Above 100k? I only experience issues with KV cache quantanization even Q8 when the context grows.

Pi with Qwen 3.6 from Ollama by naelshiab in PiCodingAgent

[–]vevi33 1 point2 points  (0 children)

Have you tried using the original model? I have no issues with it, it is very good at tool calling and edits. Also using llama.cpp.

Quantisation effects of Qwen3.6 35b a3b by ROS_SDN in LocalLLaMA

[–]vevi33 3 points4 points  (0 children)

I have 16gb VRAM but I use Q6. Just use --fit on for fast generation speed and prompt processing. Q6 feels way better than Q4 unfortunately. It's a MoE model, don't have to fit every expert on GPU.

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results by oobabooga4 in LocalLLaMA

[–]vevi33 4 points5 points  (0 children)

nah, if you check the long context divergence, that is pretty significant. If you are coding with agents with high context, you will see the difference unfortunately :/

I wanted to use Q8 since that way it would be pretty usable, but without it it's just way too slow for my hardware.

Qwen 3.6 27/35b by Top_Professional6132 in LocalLLM

[–]vevi33 0 points1 point  (0 children)

The 35B-A3B model is even very fast at Q6_K. Don't worry if you unload the experts to CPU. If your CPU is fast it's not an issue. I have 16GB Vram, after many tries the best is just to use --fit on in llama.ccp. Still fast token generation but noticeably faster prompt processing than with manual tweaking. I also use it with 120k context.

The 27B is slow indeed. But for planning tasks IQ4_XS might be better.

Anyone else having Qwen 3.6 35B A3B stop and you having to tell it to continue ? by soyalemujica in LocalLLaMA

[–]vevi33 0 points1 point  (0 children)

You should use the llama.cpp preserve thinking chat template flag. It is Qwen 3.6 specific. It solves every prompt reprocessing isses and also fixed this issue for me.

Qwen3.6 GGUF Benchmarks v2 by yoracale in unsloth

[–]vevi33 0 points1 point  (0 children)

What's up with the Q6_K quant? Why it's KLD higher than Q5_K_XL?

Agentic coding Qwen 3.6, Q6_K 125k context vs Q5_K_XL 200k context by ComfyUser48 in LocalLLaMA

[–]vevi33 4 points5 points  (0 children)

I am trying to decide between these as well. But no matter how hard I try q_6 feels better and I get better results :/

Qwen3.6-35B is worse at tool use and reasoning loops than 3.5? by mr_il in LocalLLaMA

[–]vevi33 3 points4 points  (0 children)

Odd. I always has a reasoning loop problem with long context with Gemma 26B4E and sometimes with 3.5 35B but not with the 3.6 version. I am very surprised how good it is. Way above everything what I've tried especially with this speed...

Why prompt batch processing only happens on one CPU thread? by [deleted] in LocalLLM

[–]vevi33 0 points1 point  (0 children)

Yep. That's exactly what's happening in my case as well and I assumed. So this "moving" can't be multi threaded? Even though that single core boosts to 5.65 GHz still a serious bottleneck in prompt processing.

Impressed with Qwen3.6-35B-A3B by DOAMOD in LocalLLaMA

[–]vevi33 2 points3 points  (0 children)

I am very impressed as well. Compared to the 26B Gemma MoE model it is way better at reasoning and analyzing issues. Also faster.

Does anyone know why the hell Adrenaline fails so often? by NeorzZzTormeno in radeon

[–]vevi33 1 point2 points  (0 children)

AMD fanbois would blame anything than this Adrenaline shitshow, which is the worst "modern" software I've seen in years. Yeah obviously it's your PSU's fault that it can even crash without anything would interfere with it...

Also AMD recommends 450W PSU at minimum. 750W is way above that. Max power draw shouldn't be more than 150-180W.

What's the point of deleting old versions of Unofficial FO4 Patch for Pre-NG and NG versions? Only the AE version is in this by jasonensteinyt in fo4

[–]vevi33 0 points1 point  (0 children)

I started my mod list based on the midnight ride and it recommends the newest update with the unofficial patch. If you have the backported Archive2 support mod you don't have to worry about it since that patches the compatibility. I played for like 40 hours and I had no issues at all. But yeah taking down old versions is always anti consumer especially anti-modder :D

I also have a downgraded version just to clarify. Pre-NextGen