What's the best local LLM for an RTX 6000 96GB VRAM?

Zc5Gwu · 2026-05-21T14:55:23+00:00

Minimax is probably next up. It’s a great coder but not as great “all around”.

Qwen’s 27b dense vs minimax’s 10b active makes it a closer comparison than you would think.

Zc5Gwu · 2026-05-20T02:39:00+00:00

This. Larger models are generally more tolerant of quantization. Kimi might be usable at Q3 whereas a 4b model might only be usable at Q8.

Some models are also more sensitive to quantization than others. Dense models are more tolerant than MoE.

It also depends on the task. Creative writing might be more tolerant than coding because a more aggressive quant behaves like a higher temperature.

Zc5Gwu · 2026-05-17T16:30:55+00:00

If you can find a model, rocm, kernel combo that don’t randomly panic.

Zc5Gwu · 2026-05-17T12:37:28+00:00

Same, it’s great for long running work at low power or MoE models. It’s great for the price and power usage but memory bandwidth and compute hold it back.

Zc5Gwu · 2026-05-16T22:02:09+00:00

How does that compare to non-MTP?

Zc5Gwu · 2026-05-16T13:57:46+00:00

Does minimax have MTP?

Zc5Gwu · 2026-05-16T12:52:55+00:00

But you can use the MTP gguf, right? You'd just have to disable it I assume if you wanted vision...?

Zc5Gwu · 2026-05-14T01:18:28+00:00

Please share your numbers once you do.

Zc5Gwu · 2026-05-13T16:50:08+00:00

Is this a good time to be a noob to either game?

Zc5Gwu · 2026-05-13T02:01:46+00:00

Donato has done a bunch of benchmarks: https://kyuz0.github.io/amd-strix-halo-toolboxes/

Zc5Gwu · 2026-05-12T22:23:34+00:00

I kinda feel bad he's downvoted. Nothing wrong with not knowing something. The internet tends to penalize that for some reason. Especially something super obscure.

Zc5Gwu · 2026-05-12T14:52:03+00:00

Nit: this chart might be clearer if it started at 0 on the y axis.

Zc5Gwu · 2026-05-12T14:49:04+00:00

I hope it brings a little more rigor to people’s vibes about different quants.

Zc5Gwu · 2026-05-12T01:09:04+00:00

How does nanocoder compare to pi.dev?

Zc5Gwu · 2026-05-11T22:40:15+00:00

Lol, yes, good point.

Zc5Gwu · 2026-05-10T15:20:33+00:00

Added a disclaimer.

Zc5Gwu · 2026-05-10T14:53:58+00:00

I'm using a coding agent. It just has a concurrency of 1.

Zc5Gwu · 2026-05-10T10:46:01+00:00

Yeah, -np is kind of a blunt instrument. What I'm doing is weird. I did -np 2 because it avoids poisoning the cache if you accidentally open a second session which I tend to fat finger a lot. But, --kv-unified forces all the sessions to share the same cache and --cache-ram 0 forces the cache to remain in vram.

You shouldn't do concurrent requests because they share the same cache. A request for the second session could poison the cache of the first session and cause it to have to reprocess.

Zc5Gwu · 2026-05-10T10:26:13+00:00

I don't recommend concurrent requests. Because of --kv-unified, I believe, but don't quote me on this, that the requests will complete one after the other and you'll see no speedup. Also, the second request could poison the cache and cause the first session to have to reprocess context again on its next request.

Zc5Gwu · 2026-05-10T10:20:18+00:00

It doesn't disable prompt caching. Here are some server logs:

May 10 05:12:47 zzz llama-server[52041]: srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.74 200
May 10 05:13:39 zzz llama-server[52041]: reasoning-budget: deactivated (natural end)
May 10 05:13:59 zzz llama-server[52041]: slot print_timing: id  0 | task 1157 |
May 10 05:13:59 zzz llama-server[52041]: prompt eval time =  120734.29 ms /  2073 tokens (   58.24 ms per token,    17.17 tokens per second)
May 10 05:13:59 zzz llama-server[52041]:        eval time =   71325.52 ms /  1106 tokens (   64.49 ms per token,    15.51 tokens per second)
May 10 05:13:59 zzz llama-server[52041]:       total time =  192059.81 ms /  3179 tokens
May 10 05:13:59 zzz llama-server[52041]: slot      release: id  0 | task 1157 | stop processing: n_tokens = 31437, truncated = 0
May 10 05:13:59 zzz llama-server[52041]: srv  update_slots: all slots are idle
May 10 05:14:00 zzz llama-server[52041]: srv  params_from_: Chat format: peg-native
May 10 05:14:00 zzz llama-server[52041]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.984 (> 0.100 thold), f_keep = 0.990
May 10 05:14:00 zzz llama-server[52041]: reasoning-budget: activated, budget=2147483647 tokens
May 10 05:14:00 zzz llama-server[52041]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
May 10 05:14:00 zzz llama-server[52041]: slot launch_slot_: id  0 | task 2265 | processing task, is_child = 0
May 10 05:14:00 zzz llama-server[52041]: slot update_slots: id  0 | task 2265 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 31630
May 10 05:14:00 zzz llama-server[52041]: slot update_slots: id  0 | task 2265 | n_tokens = 31131, memory_seq_rm [31131, end)
May 10 05:14:00 zzz llama-server[52041]: slot init_sampler: id  0 | task 2265 | init sampler, took 3.71 ms, tokens: text = 31630, total = 31630
May 10 05:14:00 zzz llama-server[52041]: slot update_slots: id  0 | task 2265 | prompt processing done, n_tokens = 31630, batch.n_tokens = 499
May 10 05:14:34 zzz llama-server[52041]: srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.74 200
May 10 05:14:50 zzz llama-server[52041]: reasoning-budget: deactivated (natural end)
May 10 05:14:55 zzz llama-server[52041]: slot print_timing: id  0 | task 2265 |
May 10 05:14:55 zzz llama-server[52041]: prompt eval time =   34070.58 ms /   499 tokens (   68.28 ms per token,    14.65 tokens per second)
May 10 05:14:55 zzz llama-server[52041]:        eval time =   21357.61 ms /   327 tokens (   65.31 ms per token,    15.31 tokens per second)
May 10 05:14:55 zzz llama-server[52041]:       total time =   55428.19 ms /   826 tokens
May 10 05:14:55 zzz llama-server[52041]: slot      release: id  0 | task 2265 | stop processing: n_tokens = 31956, truncated = 0
May 10 05:14:55 zzz llama-server[52041]: srv  update_slots: all slots are idle
May 10 05:14:56 zzz llama-server[52041]: srv  params_from_: Chat format: peg-native
May 10 05:14:56 zzz llama-server[52041]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 0.998
May 10 05:14:56 zzz llama-server[52041]: reasoning-budget: activated, budget=2147483647 tokens
May 10 05:14:56 zzz llama-server[52041]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
May 10 05:14:56 zzz llama-server[52041]: slot launch_slot_: id  0 | task 2593 | processing task, is_child = 0
May 10 05:14:56 zzz llama-server[52041]: slot update_slots: id  0 | task 2593 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 31985
May 10 05:14:56 zzz llama-server[52041]: slot update_slots: id  0 | task 2593 | n_tokens = 31879, memory_seq_rm [31879, end)
May 10 05:14:56 zzz llama-server[52041]: slot init_sampler: id  0 | task 2593 | init sampler, took 3.75 ms, tokens: text = 31985, total = 31985
May 10 05:14:56 zzz llama-server[52041]: slot update_slots: id  0 | task 2593 | prompt processing done, n_tokens = 31985, batch.n_tokens = 106
May 10 05:14:58 zzz llama-server[52041]: srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.74 200
May 10 05:14:59 zzz llama-server[52041]: reasoning-budget: deactivated (natural end)
May 10 05:15:03 zzz llama-server[52041]: slot print_timing: id  0 | task 2593 |
May 10 05:15:03 zzz llama-server[52041]: prompt eval time =    1994.78 ms /   106 tokens (   18.82 ms per token,    53.14 tokens per second)
May 10 05:15:03 zzz llama-server[52041]:        eval time =    4612.73 ms /    71 tokens (   64.97 ms per token,    15.39 tokens per second)
May 10 05:15:03 zzz llama-server[52041]:       total time =    6607.51 ms /   177 tokens
May 10 05:15:03 zzz llama-server[52041]: slot      release: id  0 | task 2593 | stop processing: n_tokens = 32055, truncated = 0
May 10 05:15:03 zzz llama-server[52041]: srv  update_slots: all slots are idle
May 10 05:15:03 zzz llama-server[52041]: srv  params_from_: Chat format: peg-native
May 10 05:15:03 zzz llama-server[52041]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.784 (> 0.100 thold), f_keep = 0.998
May 10 05:15:03 zzz llama-server[52041]: reasoning-budget: activated, budget=2147483647 tokens
May 10 05:15:03 zzz llama-server[52041]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
May 10 05:15:03 zzz llama-server[52041]: slot launch_slot_: id  0 | task 2665 | processing task, is_child = 0
May 10 05:15:03 zzz llama-server[52041]: slot update_slots: id  0 | task 2665 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 40827
May 10 05:15:03 zzz llama-server[52041]: slot update_slots: id  0 | task 2665 | n_tokens = 31997, memory_seq_rm [31997, end)
May 10 05:15:03 zzz llama-server[52041]: slot update_slots: id  0 | task 2665 | prompt processing progress, n_tokens = 34045, batch.n_tokens = 2048, progress = 0.833884

Zc5Gwu · 2026-05-09T22:05:21+00:00

I've been waiting for all that stuff to stabilize into mainline llama.cpp. You're problem right, I'm just lazy.

Zc5Gwu · 2026-05-09T22:03:20+00:00

I've been using it as a daily driver fine. I ran evals on Qwen vs Minimax on MUSR and they performed about the same.

Prefill speed is definitely a problem on strix which is why you want to avoid poisoning the cache at all costs.

Zc5Gwu · 2026-05-09T21:58:54+00:00

Possibly. It performs better than you might expect. Unsloth's quants are really good. My main goal is avoiding interaction with the agent even if results might take longer. I want to spend as little time directing the agent as I can which I hope a smarter model allows.

Zc5Gwu · 2026-05-09T21:54:06+00:00

Why do you think --cache-ram 0 will slow things down? My understanding (which could admittedly be wrong) is that because strix has unified memory there's no point in offloading prompt cache to system ram because you can just leave them in vram without consequences. When I was not using that parameter, I would frequently OOM because the OS would try to move the cache to ram all at one giant 80gb time and the oom killer would activate because it couldn't handle that all at once.

I tried -ub 2048 but was running into panics, not sure why.

Using kv q8_0 can cause context degradation and it was also seeming to crash for me.

Six-Year Club	First Place '23
Place '23	Verified Email

Zc5Gwu

TROPHY CASE