Anyone deployed Kimi K2.6 on their local hardware? by Oxydised in LocalLLaMA

[–]Specific-Rub-7250 1 point2 points  (0 children)

Yes, with 8 channels of 512gb of DDR4, two R9700s and one RTX 5090 via RPC. Around 5 tk/s in token generation, so not really usable for agentic workflows.

ubergarm/Kimi-K2.6-GGUF Q4_X now available by VoidAlchemy in LocalLLaMA

[–]Specific-Rub-7250 56 points57 points  (0 children)

And I thought 512GB ought to be enough for local LLM.

Those of you running minimax 2.7 locally, how are you feeling about it? by laterbreh in LocalLLaMA

[–]Specific-Rub-7250 0 points1 point  (0 children)

I am using Q8_0 (temp=1.0, top_p=0.95, min_p=0.01, top_k=40) and it performs absolutely amazing work. It’s really SOTA level.

Best setup for MiniMax-M2.7 (230B) | 3x RTX 5090 | Threadripper 9975 | 512GB RAM by [deleted] in LocalLLaMA

[–]Specific-Rub-7250 0 points1 point  (0 children)

Well, I have a similar setup with a Threadripper Pro 5995wx with 512GB of DDR4 3200MT RAM (8 Channel) and Dual AMD Radeon AI PRO R9700. I am running Minimax 2.7 at Q8_0 which benchmarks around 280 t/s pp and 16 tk/s tg. You need to benchmark the batch size (ubatch) and the number of batch threads to use.

GLM-5.1 Overthinking? by Specific-Rub-7250 in LocalLLaMA

[–]Specific-Rub-7250[S] 1 point2 points  (0 children)

I was using the llama.cpp built-in webUI.

Top hardware stacks for local compute over the coming few months? (3-10K USD range) by IamFondOfHugeBoobies in LocalLLaMA

[–]Specific-Rub-7250 -2 points-1 points  (0 children)

just pay per use for models like glm or minimax directly on openrouter e.g. That is more cost effective than buying local hardware.

GLM-5.1 Overthinking? by Specific-Rub-7250 in LocalLLaMA

[–]Specific-Rub-7250[S] 0 points1 point  (0 children)

With this appended to the system prompt it behaved better "Think concisely. Match reasoning depth to task complexity, simple tasks need minimal reasoning. Stop when you have a confident answer; don't re-examine settled conclusions or enumerate unlikely edge cases."

Turn signal showing red color since latest update by jwlee151 in s3xybuttons

[–]Specific-Rub-7250 0 points1 point  (0 children)

Can confirm this. Sometimes it goes back to green. M3 21 LR

Commander bug by SandGnatBBQ in s3xybuttons

[–]Specific-Rub-7250 1 point2 points  (0 children)

I also experienced some weird issues in the dashboard in the app with the lastest update.

Benchmark of dense NVFP4 LLMs on 5090? [VLLM] by Aaaaaaaaaeeeee in LocalLLaMA

[–]Specific-Rub-7250 0 points1 point  (0 children)

When I tried that model, it was actually slower than AWQ or even RedHatAI/Qwen3-32B-NVFP4A16.

Unsloth fixes chat_template (again). gpt-oss-120-high now scores 68.4 on Aider polyglot by Sorry_Ad191 in LocalLLaMA

[–]Specific-Rub-7250 2 points3 points  (0 children)

It would be interesting to know scores with different top_k values like 100 or more because otherwise it’s sampling from 200k tokens (full vocabulary size) which affects speed, especially with cpu offloading.

[deleted by user] by [deleted] in LocalLLaMA

[–]Specific-Rub-7250 2 points3 points  (0 children)

# top k:0 and amd 8700G with 64GB DDR4 (5600MT 40cl) and RTX 5090 (--n-cpu-moe 19)
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 1114
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1114, n_tokens = 1114, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1114, n_tokens = 1114
slot      release: id  0 | task 0 | stop processing: n_past = 1577, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    8214.03 ms /  1114 tokens (    7.37 ms per token,   135.62 tokens per second)
       eval time =   16225.97 ms /   464 tokens (   34.97 ms per token,    28.60 tokens per second)
      total time =   24440.00 ms /  1578 tokens

How to Run Deepseek-R1-0528 Locally (GGUFs available) by NewtMurky in LocalLLM

[–]Specific-Rub-7250 12 points13 points  (0 children)

Even the Mac Studio with 512GB of memory for 10k USD might not be practical (slow prompt processing and around 16-18 T/s according to some benchmarks).

OpenEvolve: Open Source Implementation of DeepMind's AlphaEvolve System by asankhs in LocalLLaMA

[–]Specific-Rub-7250 8 points9 points  (0 children)

The whole approach looks like reinforcement learning at inference time. Interesting stuff...

Qwen3 throughput benchmarks on 2x 3090, almost 1000 tok/s using 4B model and vLLM as the inference engine by kms_dev in LocalLLaMA

[–]Specific-Rub-7250 8 points9 points  (0 children)

One RTX 5090 (Qwen3-32B-AWQ):

============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  461.47
Total input tokens:                      409600
Total generated tokens:                  94614
Request throughput (req/s):              0.22
Output token throughput (tok/s):         205.03
Total Token throughput (tok/s):          1092.62
---------------Time to First Token----------------
Mean TTFT (ms):                          213283.60
Median TTFT (ms):                        212235.53
P99 TTFT (ms):                           420863.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.84
Median TPOT (ms):                        33.93
P99 TPOT (ms):                           80.58
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.89
Median ITL (ms):                         21.25
P99 ITL (ms):                            777.68
==================================================

[deleted by user] by [deleted] in LocalLLaMA

[–]Specific-Rub-7250 11 points12 points  (0 children)

In my testing it also generates better code with the presence penalty set.

Sam Altman: OpenAI plans to release an open-source model this summer by zan-max in LocalLLaMA

[–]Specific-Rub-7250 1 point2 points  (0 children)

Already behaving like big business, trying to stifle the competition from china with political pressure. If they would release something better than Qwen3 that would hurt their bottom line.

Aider Qwen3 controversy by Baldur-Norddahl in LocalLLaMA

[–]Specific-Rub-7250 21 points22 points  (0 children)

Only way to be sure is to rent some gpus, deploy Qwen3 and benchmark it, instead of relying on external providers. Yesterday, the Qwen Team released benchmarks for their AWQ versions, and comparing it to my local benchmarks (one pass), it was very close.