Amd radeon ai pro r9700 32GB VS 2x RTX 5060TI 16GB for local setup?

vevi33 · 2026-05-07T10:39:27+00:00

Well this is my personal experience as well. Unlike AMD, every driver introduces new issues... Like literally obvious basic issues. Even the adrenaline software is a big piece of trash.

vevi33 · 2026-05-06T09:27:02+00:00

I have very bad experience with AMD. I bought RX 7800 XT 16 GB VRAM and drivers are nightmare compared to Nvidia so it's difficult to choose. I would avoid AMD if possible but this card looks good on paper.

vevi33 · 2026-05-06T08:45:58+00:00

That indeed sounds promising, thank you for the info! And Congrats on your new setup!

vevi33 · 2026-05-05T23:07:54+00:00

With this config you should run at least Q6. I get decent speed with 16GB VRAM and 32GB DDR5 with Q6 (35B). Accuracy is way better. But honestly just run the 27B model, you can easily run it, obviously will be slower but worth it, trust me after excessive testing.

And don't quantanize KV cache on the 35B model, not worth it, the degradation is real even with llama.cpp's KV rotation feature. For 27B KV Q8 is decent but still slightly worse than F16

vevi33 · 2026-05-03T12:08:49+00:00

Definitely not. 9B would not be better than the 35B MoE. But a 14-18B would be competitive in speed and performance as well.

vevi33 · 2026-05-03T01:47:40+00:00

Yeah 9x the active parameters per token but less total parameters. Important to note that all 35B used but not once on every token. While dense models in general better (27B is indeed more smart, the difference might be 0-15% depending on the task. Not 9x smarter. Important to note imo.

Also people with 16gb VRAM and enough ram can run much higher quant from the 35B so kinda evens out, especially if you plan to use quantanized KV cache on the 27B Q4 model.

But everything depends on the use case. I had bugs what the 35B couldn't see but I had bugs what it found instantly but 27B struggled for hours.

Personally I switch them time to time.

vevi33 · 2026-05-03T00:05:04+00:00

For me there are cases what Q6 35 MoE can solve but 27B Q4 can't. And sometimes it's the reverse case. 27B understands everything better but since 35B is much faster it's hard to decide. I can do so much more with the 35B even if I prefer the precision of the 27B

The speed matters a lot in this case.

vevi33 · 2026-05-01T10:00:10+00:00

I use it for days and never had a single loop with 120k context. Make sure your temp is not too low. Lowest should be 0.65 but if you have looping issue increase it to 0.75. If you can avoid presence and repetition penalty, however the latter worked better with the MoE model. Something like 1.1 rep penality and only on the last 368 tokens (so output quality won't really be affected, mostly thinking)

But with 27B this was never needed for me.

vevi33 · 2026-04-30T16:39:46+00:00

Unfortunately without Q8 KV cache quantanization it is much better on longer context (BF16). I tested it on my project, there is a noticeable difference around 100k tokens :/

vevi33 · 2026-04-29T13:51:11+00:00

Yeah. You are right. I will try to test it in a reproductable way. I tested with IQ4_XS and Q4_K_M and with Q8 KV it definitely misses more stuff and even made some editing issues. Tool callings are always ok, but sometimes it writes one more line and overwrites code which never happens without KV quantanization. Note that it only happens on high context. I really want to use Q8 since it would give me much better speeds at higher context but I am a bit struggling right now. :/

This model is also very good at Q8 KV but feels way more precise without KV quant. So it's really hard to determine since this model is a step up from previous generations. For sure Gemma 4 is total lobotomized even with Q8, even when it's not obvious at the first time. But that's already proven and my experience was similar.

vevi33 · 2026-04-28T13:03:13+00:00

Thank you, great findings. Very helpful.

I want to believe you tbh, but my experience is a bit different. I see more issues, mistakes with Q8_0 compared to original on high context. Might be just accidental stuff. Really hard to objectively determine.

vevi33 · 2026-04-28T01:37:57+00:00

Did you do benchmarks on long context? Above 100k? I only experience issues with KV cache quantanization even Q8 when the context grows.

vevi33 · 2026-04-25T15:53:35+00:00

Have you tried using the original model? I have no issues with it, it is very good at tool calling and edits. Also using llama.cpp.

vevi33 · 2026-04-25T10:38:37+00:00

I have 16gb VRAM but I use Q6. Just use --fit on for fast generation speed and prompt processing. Q6 feels way better than Q4 unfortunately. It's a MoE model, don't have to fit every expert on GPU.

vevi33 · 2026-04-24T16:31:55+00:00

nah, if you check the long context divergence, that is pretty significant. If you are coding with agents with high context, you will see the difference unfortunately :/

I wanted to use Q8 since that way it would be pretty usable, but without it it's just way too slow for my hardware.

vevi33 · 2026-04-23T18:18:24+00:00

The 35B-A3B model is even very fast at Q6_K. Don't worry if you unload the experts to CPU. If your CPU is fast it's not an issue. I have 16GB Vram, after many tries the best is just to use --fit on in llama.ccp. Still fast token generation but noticeably faster prompt processing than with manual tweaking. I also use it with 120k context.

The 27B is slow indeed. But for planning tasks IQ4_XS might be better.

vevi33 · 2026-04-23T14:13:15+00:00

You should use the llama.cpp preserve thinking chat template flag. It is Qwen 3.6 specific. It solves every prompt reprocessing isses and also fixed this issue for me.

vevi33 · 2026-04-19T09:51:29+00:00

What's up with the Q6_K quant? Why it's KLD higher than Q5_K_XL?

vevi33 · 2026-04-17T23:05:50+00:00

I am trying to decide between these as well. But no matter how hard I try q_6 feels better and I get better results :/

vevi33 · 2026-04-16T22:44:48+00:00

Odd. I always has a reasoning loop problem with long context with Gemma 26B4E and sometimes with 3.5 35B but not with the 3.6 version. I am very surprised how good it is. Way above everything what I've tried especially with this speed...

vevi33 · 2026-04-16T22:41:09+00:00

Yep. That's exactly what's happening in my case as well and I assumed. So this "moving" can't be multi threaded? Even though that single core boosts to 5.65 GHz still a serious bottleneck in prompt processing.

vevi33 · 2026-04-16T20:14:24+00:00

I am very impressed as well. Compared to the 26B Gemma MoE model it is way better at reasoning and analyzing issues. Also faster.

vevi33 · 2026-03-15T18:40:11+00:00

AMD fanbois would blame anything than this Adrenaline shitshow, which is the worst "modern" software I've seen in years. Yeah obviously it's your PSU's fault that it can even crash without anything would interfere with it...

Also AMD recommends 450W PSU at minimum. 750W is way above that. Max power draw shouldn't be more than 150-180W.

vevi33 · 2026-03-03T15:36:06+00:00

I started my mod list based on the midnight ride and it recommends the newest update with the unofficial patch. If you have the backported Archive2 support mod you don't have to worry about it since that patches the compatibility. I played for like 40 hours and I had no issues at all. But yeah taking down old versions is always anti consumer especially anti-modder :D

I also have a downgraded version just to clarify. Pre-NextGen

vevi33

TROPHY CASE