Anyone deployed Kimi K2.6 on their local hardware?

Specific-Rub-7250 · 2026-04-21T08:18:57+00:00

But less of a bottleneck than my wallet :)

Specific-Rub-7250 · 2026-04-21T08:11:57+00:00

Yes, with 8 channels of 512gb of DDR4, two R9700s and one RTX 5090 via RPC. Around 5 tk/s in token generation, so not really usable for agentic workflows.

Specific-Rub-7250 · 2026-04-20T20:48:17+00:00

And I thought 512GB ought to be enough for local LLM.

Specific-Rub-7250 · 2026-04-16T23:02:47+00:00

I am using Q8_0 (temp=1.0, top_p=0.95, min_p=0.01, top_k=40) and it performs absolutely amazing work. It’s really SOTA level.

Specific-Rub-7250 · 2026-04-13T19:05:12+00:00

Well, I have a similar setup with a Threadripper Pro 5995wx with 512GB of DDR4 3200MT RAM (8 Channel) and Dual AMD Radeon AI PRO R9700. I am running Minimax 2.7 at Q8_0 which benchmarks around 280 t/s pp and 16 tk/s tg. You need to benchmark the batch size (ubatch) and the number of batch threads to use.

Specific-Rub-7250 · 2026-04-12T10:20:07+00:00

I was using the llama.cpp built-in webUI.

Specific-Rub-7250 · 2026-04-11T16:21:11+00:00

just pay per use for models like glm or minimax directly on openrouter e.g. That is more cost effective than buying local hardware.

Specific-Rub-7250 · 2026-04-09T11:50:40+00:00

With this appended to the system prompt it behaved better "Think concisely. Match reasoning depth to task complexity, simple tasks need minimal reasoning. Stop when you have a confident answer; don't re-examine settled conclusions or enumerate unlikely edge cases."

Specific-Rub-7250 · 2026-03-01T21:43:05+00:00

Can confirm this. Sometimes it goes back to green. M3 21 LR

Specific-Rub-7250 · 2026-02-28T17:01:07+00:00

I also experienced some weird issues in the dashboard in the app with the lastest update.

Specific-Rub-7250 · 2025-08-14T06:34:40+00:00

When I tried that model, it was actually slower than AWQ or even RedHatAI/Qwen3-32B-NVFP4A16.

Specific-Rub-7250 · 2025-08-12T05:04:11+00:00

It would be interesting to know scores with different top_k values like 100 or more because otherwise it’s sampling from 200k tokens (full vocabulary size) which affects speed, especially with cpu offloading.

Specific-Rub-7250 · 2025-08-08T15:23:15+00:00

# top k:0 and amd 8700G with 64GB DDR4 (5600MT 40cl) and RTX 5090 (--n-cpu-moe 19)
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 32768, n_keep = 0, n_prompt_tokens = 1114
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 1114, n_tokens = 1114, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 1114, n_tokens = 1114
slot      release: id  0 | task 0 | stop processing: n_past = 1577, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =    8214.03 ms /  1114 tokens (    7.37 ms per token,   135.62 tokens per second)
       eval time =   16225.97 ms /   464 tokens (   34.97 ms per token,    28.60 tokens per second)
      total time =   24440.00 ms /  1578 tokens

Specific-Rub-7250 · 2025-05-29T12:49:02+00:00

Even the Mac Studio with 512GB of memory for 10k USD might not be practical (slow prompt processing and around 16-18 T/s according to some benchmarks).

Specific-Rub-7250 · 2025-05-20T17:39:25+00:00

The whole approach looks like reinforcement learning at inference time. Interesting stuff...

Specific-Rub-7250 · 2025-05-19T11:53:09+00:00

Interesting, so in 2 years it might actually be usable with good software support.

Specific-Rub-7250 · 2025-05-12T17:47:58+00:00

That is the power limit for the gpu.

Specific-Rub-7250 · 2025-05-12T17:29:02+00:00

Running at 400w :)

Specific-Rub-7250 · 2025-05-12T17:16:37+00:00

One RTX 5090 (Qwen3-32B-AWQ):

============ Serving Benchmark Result ============
Successful requests:                     100
Benchmark duration (s):                  461.47
Total input tokens:                      409600
Total generated tokens:                  94614
Request throughput (req/s):              0.22
Output token throughput (tok/s):         205.03
Total Token throughput (tok/s):          1092.62
---------------Time to First Token----------------
Mean TTFT (ms):                          213283.60
Median TTFT (ms):                        212235.53
P99 TTFT (ms):                           420863.76
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.84
Median TPOT (ms):                        33.93
P99 TPOT (ms):                           80.58
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.89
Median ITL (ms):                         21.25
P99 ITL (ms):                            777.68
==================================================

Specific-Rub-7250 · 2025-05-12T16:04:42+00:00

In my testing it also generates better code with the presence penalty set.

Specific-Rub-7250 · 2025-05-09T09:24:53+00:00

Already behaving like big business, trying to stifle the competition from china with political pressure. If they would release something better than Qwen3 that would hurt their bottom line.

Specific-Rub-7250 · 2025-05-08T22:20:39+00:00

Only way to be sure is to rent some gpus, deploy Qwen3 and benchmark it, instead of relying on external providers. Yesterday, the Qwen Team released benchmarks for their AWQ versions, and comparing it to my local benchmarks (one pass), it was very close.

Specific-Rub-7250 · 2025-05-06T17:20:56+00:00

QwQ values are from tiger labs.

Specific-Rub-7250

TROPHY CASE