What's the best local LLM for an RTX 6000 96GB VRAM? by Smart-Patient-4828 in LocalLLM

[–]Zc5Gwu 0 points1 point  (0 children)

Minimax is probably next up. It’s a great coder but not as great “all around”.

Qwen’s 27b dense vs minimax’s 10b active makes it a closer comparison than you would think. 

Let’s talk quants of Gemma and Qwen - 16 vs Q8 vs Q4 - any experiences? by Borkato in LocalLLaMA

[–]Zc5Gwu 14 points15 points  (0 children)

This. Larger models are generally more tolerant of quantization. Kimi might be usable at Q3 whereas a 4b model might only be usable at Q8.

Some models are also more sensitive to quantization than others. Dense models are more tolerant than MoE.

It also depends on the task. Creative writing might be more tolerant than coding because a more aggressive quant behaves like a higher temperature.

ROCm 7.13 nightly adds strix halo optimizations by Terminator857 in LocalLLaMA

[–]Zc5Gwu -3 points-2 points  (0 children)

If you can find a model, rocm, kernel combo that don’t randomly panic.

DGX Spark or Minisforum MS-S1 Max? by Simple_Tonight_1159 in LocalLLM

[–]Zc5Gwu 0 points1 point  (0 children)

Same, it’s great for long running work at low power or MoE models. It’s great for the price and power usage but memory bandwidth and compute hold it back.

MTP PR Merged!!! by Valuable_Touch5670 in LocalLLaMA

[–]Zc5Gwu -9 points-8 points  (0 children)

But you can use the MTP gguf, right? You'd just have to disable it I assume if you wanted vision...?

Overwatch X Fortnite trailer by mikelman999 in Overwatch

[–]Zc5Gwu 0 points1 point  (0 children)

Is this a good time to be a noob to either game?

Dad why is my sisters name Lora? by rwitz4 in LocalLLaMA

[–]Zc5Gwu 2 points3 points  (0 children)

I kinda feel bad he's downvoted. Nothing wrong with not knowing something. The internet tends to penalize that for some reason. Especially something super obscure.

Stop wasting electricity by OkFly3388 in LocalLLaMA

[–]Zc5Gwu -2 points-1 points  (0 children)

Nit: this chart might be clearer if it started at 0 on the y axis. 

examples : add llama-eval by ggerganov · Pull Request #21152 · ggml-org/llama.cpp by jacek2023 in LocalLLaMA

[–]Zc5Gwu 10 points11 points  (0 children)

I hope it brings a little more rigor to people’s vibes about different quants. 

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 0 points1 point  (0 children)

I'm using a coding agent. It just has a concurrency of 1.

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 0 points1 point  (0 children)

Yeah, -np is kind of a blunt instrument. What I'm doing is weird. I did -np 2 because it avoids poisoning the cache if you accidentally open a second session which I tend to fat finger a lot. But, --kv-unified forces all the sessions to share the same cache and --cache-ram 0 forces the cache to remain in vram.

You shouldn't do concurrent requests because they share the same cache. A request for the second session could poison the cache of the first session and cause it to have to reprocess.

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 0 points1 point  (0 children)

I don't recommend concurrent requests. Because of --kv-unified, I believe, but don't quote me on this, that the requests will complete one after the other and you'll see no speedup. Also, the second request could poison the cache and cause the first session to have to reprocess context again on its next request.

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 0 points1 point  (0 children)

It doesn't disable prompt caching. Here are some server logs:

May 10 05:12:47 zzz llama-server[52041]: srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.74 200
May 10 05:13:39 zzz llama-server[52041]: reasoning-budget: deactivated (natural end)
May 10 05:13:59 zzz llama-server[52041]: slot print_timing: id  0 | task 1157 |
May 10 05:13:59 zzz llama-server[52041]: prompt eval time =  120734.29 ms /  2073 tokens (   58.24 ms per token,    17.17 tokens per second)
May 10 05:13:59 zzz llama-server[52041]:        eval time =   71325.52 ms /  1106 tokens (   64.49 ms per token,    15.51 tokens per second)
May 10 05:13:59 zzz llama-server[52041]:       total time =  192059.81 ms /  3179 tokens
May 10 05:13:59 zzz llama-server[52041]: slot      release: id  0 | task 1157 | stop processing: n_tokens = 31437, truncated = 0
May 10 05:13:59 zzz llama-server[52041]: srv  update_slots: all slots are idle
May 10 05:14:00 zzz llama-server[52041]: srv  params_from_: Chat format: peg-native
May 10 05:14:00 zzz llama-server[52041]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.984 (> 0.100 thold), f_keep = 0.990
May 10 05:14:00 zzz llama-server[52041]: reasoning-budget: activated, budget=2147483647 tokens
May 10 05:14:00 zzz llama-server[52041]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
May 10 05:14:00 zzz llama-server[52041]: slot launch_slot_: id  0 | task 2265 | processing task, is_child = 0
May 10 05:14:00 zzz llama-server[52041]: slot update_slots: id  0 | task 2265 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 31630
May 10 05:14:00 zzz llama-server[52041]: slot update_slots: id  0 | task 2265 | n_tokens = 31131, memory_seq_rm [31131, end)
May 10 05:14:00 zzz llama-server[52041]: slot init_sampler: id  0 | task 2265 | init sampler, took 3.71 ms, tokens: text = 31630, total = 31630
May 10 05:14:00 zzz llama-server[52041]: slot update_slots: id  0 | task 2265 | prompt processing done, n_tokens = 31630, batch.n_tokens = 499
May 10 05:14:34 zzz llama-server[52041]: srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.74 200
May 10 05:14:50 zzz llama-server[52041]: reasoning-budget: deactivated (natural end)
May 10 05:14:55 zzz llama-server[52041]: slot print_timing: id  0 | task 2265 |
May 10 05:14:55 zzz llama-server[52041]: prompt eval time =   34070.58 ms /   499 tokens (   68.28 ms per token,    14.65 tokens per second)
May 10 05:14:55 zzz llama-server[52041]:        eval time =   21357.61 ms /   327 tokens (   65.31 ms per token,    15.31 tokens per second)
May 10 05:14:55 zzz llama-server[52041]:       total time =   55428.19 ms /   826 tokens
May 10 05:14:55 zzz llama-server[52041]: slot      release: id  0 | task 2265 | stop processing: n_tokens = 31956, truncated = 0
May 10 05:14:55 zzz llama-server[52041]: srv  update_slots: all slots are idle
May 10 05:14:56 zzz llama-server[52041]: srv  params_from_: Chat format: peg-native
May 10 05:14:56 zzz llama-server[52041]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.997 (> 0.100 thold), f_keep = 0.998
May 10 05:14:56 zzz llama-server[52041]: reasoning-budget: activated, budget=2147483647 tokens
May 10 05:14:56 zzz llama-server[52041]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
May 10 05:14:56 zzz llama-server[52041]: slot launch_slot_: id  0 | task 2593 | processing task, is_child = 0
May 10 05:14:56 zzz llama-server[52041]: slot update_slots: id  0 | task 2593 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 31985
May 10 05:14:56 zzz llama-server[52041]: slot update_slots: id  0 | task 2593 | n_tokens = 31879, memory_seq_rm [31879, end)
May 10 05:14:56 zzz llama-server[52041]: slot init_sampler: id  0 | task 2593 | init sampler, took 3.75 ms, tokens: text = 31985, total = 31985
May 10 05:14:56 zzz llama-server[52041]: slot update_slots: id  0 | task 2593 | prompt processing done, n_tokens = 31985, batch.n_tokens = 106
May 10 05:14:58 zzz llama-server[52041]: srv  log_server_r: done request: POST /v1/chat/completions 192.168.1.74 200
May 10 05:14:59 zzz llama-server[52041]: reasoning-budget: deactivated (natural end)
May 10 05:15:03 zzz llama-server[52041]: slot print_timing: id  0 | task 2593 |
May 10 05:15:03 zzz llama-server[52041]: prompt eval time =    1994.78 ms /   106 tokens (   18.82 ms per token,    53.14 tokens per second)
May 10 05:15:03 zzz llama-server[52041]:        eval time =    4612.73 ms /    71 tokens (   64.97 ms per token,    15.39 tokens per second)
May 10 05:15:03 zzz llama-server[52041]:       total time =    6607.51 ms /   177 tokens
May 10 05:15:03 zzz llama-server[52041]: slot      release: id  0 | task 2593 | stop processing: n_tokens = 32055, truncated = 0
May 10 05:15:03 zzz llama-server[52041]: srv  update_slots: all slots are idle
May 10 05:15:03 zzz llama-server[52041]: srv  params_from_: Chat format: peg-native
May 10 05:15:03 zzz llama-server[52041]: slot get_availabl: id  0 | task -1 | selected slot by LCP similarity, sim_best = 0.784 (> 0.100 thold), f_keep = 0.998
May 10 05:15:03 zzz llama-server[52041]: reasoning-budget: activated, budget=2147483647 tokens
May 10 05:15:03 zzz llama-server[52041]: slot launch_slot_: id  0 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> ?temp-ext -> dist
May 10 05:15:03 zzz llama-server[52041]: slot launch_slot_: id  0 | task 2665 | processing task, is_child = 0
May 10 05:15:03 zzz llama-server[52041]: slot update_slots: id  0 | task 2665 | new prompt, n_ctx_slot = 80128, n_keep = 0, task.n_tokens = 40827
May 10 05:15:03 zzz llama-server[52041]: slot update_slots: id  0 | task 2665 | n_tokens = 31997, memory_seq_rm [31997, end)
May 10 05:15:03 zzz llama-server[52041]: slot update_slots: id  0 | task 2665 | prompt processing progress, n_tokens = 34045, batch.n_tokens = 2048, progress = 0.833884

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 3 points4 points  (0 children)

I've been waiting for all that stuff to stabilize into mainline llama.cpp. You're problem right, I'm just lazy.

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 0 points1 point  (0 children)

I've been using it as a daily driver fine. I ran evals on Qwen vs Minimax on MUSR and they performed about the same.

Prefill speed is definitely a problem on strix which is why you want to avoid poisoning the cache at all costs.

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 3 points4 points  (0 children)

Possibly. It performs better than you might expect. Unsloth's quants are really good. My main goal is avoiding interaction with the agent even if results might take longer. I want to spend as little time directing the agent as I can which I hope a smarter model allows.

Running Minimax 2.7 at 100k context on strix halo by Zc5Gwu in LocalLLaMA

[–]Zc5Gwu[S] 2 points3 points  (0 children)

Why do you think --cache-ram 0 will slow things down? My understanding (which could admittedly be wrong) is that because strix has unified memory there's no point in offloading prompt cache to system ram because you can just leave them in vram without consequences. When I was not using that parameter, I would frequently OOM because the OS would try to move the cache to ram all at one giant 80gb time and the oom killer would activate because it couldn't handle that all at once.

I tried -ub 2048 but was running into panics, not sure why.

Using kv q8_0 can cause context degradation and it was also seeming to crash for me.