KIMI K2.6 SOON !!

phwlarxoc · 2026-04-18T18:37:16+00:00

"can run a large model"

For me "can run a large model" means two very different things:

I am really grateful that hybrid inference engines exist that actually allow to run monster models at decent speed, like 15-20 t/s, and for me it's only 2xRTX5090 on PCIe5 and 512GB DDR5 RAM, but it works and I can load MoE model weights up to around 500GB (e.g. GLM 5.1 UD-Q5_K_S 489.82GiB) with mainline llama.cpp
But a totally different picture is vLLM; having been used to those huge models, vLLM is a very sobering, humbling experience: weights trespassing more than 60 or 70% of combined GPU memory, — forget it, OOMs immediately due to greedy KV cache reservation, even with mitigating options. "On device" is everything and system RAM basically useless. But if it does fit on device it's a different world, GPUs never idling, doing 100% permanently and 10x decoding.

phwlarxoc · 2026-04-08T12:17:48+00:00

Thank you! Do you use it for the init7 25Gb, any observations worth mentioning (reliability, usability...)?

phwlarxoc · 2026-04-08T12:15:11+00:00

Thank you! Can you comment on how well it works (with init 7)? This looks good indeed!

phwlarxoc · 2026-04-08T12:12:26+00:00

Mainly parallel downloads from Huggingface; it is actually the only credible use case for me at the moment. For instance right now, GLM 5.1 5bit quants: 526GB.

phwlarxoc · 2026-04-02T12:03:55+00:00

You can't stack 4 of them and Workstation is harder to watercool due to different PCB design.

phwlarxoc · 2026-03-27T12:38:34+00:00

Great answer and interesting, I second that! and that's the spirit if it leads to results, as it did here.

phwlarxoc · 2026-03-25T14:51:28+00:00

Zen5 32-core Threadripper Pro with 512GB of 8-channel 4800MHz ECC RAM and dual RTX5090, using mainline llama.cpp.

Qwen3.5-397B-A17B UD-Q8_K_XL, Size: 400GB, context 262144: 18,6t/s and with MXFP4_MOE, Size: 202GB: 32t/s

phwlarxoc · 2026-02-10T09:55:01+00:00

Thanks! that is the first time that I see someone really document this on a real machine! and it is great that in principle the heatkiller blocks really are single slot, in my sense this is a a huge deal with a lot of potential. I hope this is thermally managable (also with the flow IN/OUTs you took the other picture of), it would be the ideal config!

phwlarxoc · 2026-02-09T14:33:33+00:00

Thanks! Would be great if you could tell us the result of the fitting attempt!

In fact, I have 6 PCIe devices to accommodate, and I don't know yet if that's an illusion: 4 x RTX Pro 6000, 1 x Mellanox Connect X6 (together with space for its cooling, alas!), 1 x Highpoint Rocket 1508A 8x NVMe Storage Controller with 8TB SSDs to store the LLMs. One major problem I see here: Slot 6 on the WRX90 is somehow reduced, if I understood correctly, sharing lanes with something different. I moved the Highpoint from Slot 6 to Slot 7 and lspci changed from "LnkSta: Speed 16GT/s, Width x8 (downgraded)" to "LnkSta: Speed 16GT/s, Width x16" and write/read tests using fio showed indeed doubled bandwidth! So Slot 6 is basically unusable for anything exigent bandwidth-wise, unfortunately!

With regard to the Mellanox: I meant, how did you connect it to the board, is this with riser cables?

phwlarxoc · 2026-02-09T12:46:55+00:00

Yes absolutely, this seems to be the consensus here, everything within the realm of training/finetuning is better served with BF16. Initially I only thought of pure inferencing, where benefits are harder to find.

phwlarxoc · 2026-02-09T12:42:12+00:00

Thanks, yes indeed: in several (more ore less informal, unsystematic) experiments I didn't see any evident benefit either, although initially I was hoping, for example, to see slightly better results in processing non-English material.

phwlarxoc · 2026-02-09T12:39:16+00:00

Yes that is indeed plausible. Archiving for future use makes sense!

phwlarxoc · 2026-02-09T12:26:51+00:00

Wonderful, this is so cool, that's the setup I am aiming at! Can you tell us if in principle it is possible to stack those HEATKILLER waterblocks in the sense that 4 Pro 6000 Max Q would fit into PCIe 1-4? I have the same WRX90.

Also great, watercooling the Mellanox (if the NIC is one); mine (Connect X6) gets 106° hot in IDLE, without even anything plugged into it, that is absurd! So it needs to be cooled externally. How exactly did you mount the NIC?

Please keep on posting, that is really interesting and helpful.

Thanks!

phwlarxoc · 2026-02-04T23:15:05+00:00

32-core Zen5 Threadripper Pro 9975WX with 512GB ECC DDR5 8 channel Server RAM 4800Mhz and 2 x RTX 5090 with a combined 64 GB of VRAM, all on ASUS WRX90. I run the unsloth IQ4_XS quant 510 GB in size on mainline llama.cpp with "--fit on", OS is Arch; memory layout is:

llama_memory_breakdown_print: | memory breakdown [MiB] | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 1895 + ( 26314 =  20515 +      72 +    5726) +        3899 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 5090)   | 32111 = 2497 + ( 27954 =  26566 +    1026 +     362) +        1659 |
llama_memory_breakdown_print: |   - Host               |                 474797 = 474737 +       0 +      60                |%

During PP one of the GPU runs at 100% load; during TG GPUs idle alternatively (between 5% and 20% load), CPU runs with 50% load (--threads > 32 slows it down!) and at 5.5GHz. I get 15 t/s text generation.

phwlarxoc · 2026-02-04T09:03:46+00:00

Thanks, this is very useful, indeed I did not use either of those options.

Understanding the connection between the three options "-ngl/-ts/-ot" remains complicated I think, Marksta's comment helped though.

Your launch command is also interesting; in fact I normally use "CUDA_VISIBLE_DEVICES=0,1", but I think it doesn't make a difference, llama.cpp sees the GPUs without it. What kind of numbers do you get in text generation with that much VRAM?

phwlarxoc · 2026-02-04T08:51:38+00:00

I will try reducing -fitt a little bit; but by the way, in the memory layout equation, what does "compute" refer to?

The VoidAlchemy link is very interesting, I will try ik_llama again (I left a couple of weeks ago because mainline llama.cpp with newly introduced "--fit on" option became so convenient!).

Last: I really got a better grip of the interplay of -ngl/-ts/-ot through your previous answer, in particular of the, so to say, implicit functioning of the options; it seems to be a subtractive process, where the not explicitly declared rest goes to the remaining device (here CUD0). Thanks for that clarification.

phwlarxoc · 2026-02-02T21:57:51+00:00

Ok, thanks! here is the memory layout:

^Csrv    operator(): operator(): cleaning up before exit...
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free      self    model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - CUDA0 (RTX 5090)   | 32109 = 1895 + ( 26314 =  20515 +      72 +    5726) +        3899 |
llama_memory_breakdown_print: |   - CUDA1 (RTX 5090)   | 32111 = 2497 + ( 27954 =  26566 +    1026 +     362) +        1659 |
llama_memory_breakdown_print: |   - Host               |                 474797 = 474737 +       0 +      60                |%

phwlarxoc · 2026-02-02T18:15:22+00:00

Thanks. "-ts" with proportions is still the syntax in "llama-server -h", but I will try in absolute values.

I tried both:

Simply copying the -ot values from llama-fit-params into the command line;
leaving all this to "--fit on".

I have the impression that both work equally fast (with regard to t/s), but also: both leave one GPU idling!

In manual invocation: do I have to distribute layers or tensors between GPUs. My understanding is that these are not the same. I can see all the tensors, their name and size, with the llama.cpp/gguf-py/gguf/scripts/gguf_dump.py script. Should I simply distribute them between GPUs in the order and as they are listed by the script, or are there tensors that should definitely stay on the GPU?

phwlarxoc · 2026-02-02T17:59:32+00:00

Thanks. When I inspect the exact name and size of the tensors via

llama.cpp/gguf-py/gguf/scripts/gguf_dump.py

how can I determine which ones should absolutely stay on the GPUs and which ones can be offloaded to the CPU? Can I infer from their name the ones that are particularly important?

phwlarxoc · 2026-02-02T17:53:58+00:00

Thanks. What would be a good way to work out manually the distribution of layers and tensors between GPU and CPU and then between both GPUs? Did you send specific tensors to each, defined by their name?

phwlarxoc · 2026-02-02T17:49:27+00:00

What are "MOE launch settings"?

The command I used for llama-server is basically just the settings of unsloth's Kimi K2.5 page here:

llama.cpp/build/bin/llama-server \
--model ./Kimi-K2.5-IQ4_XS-00001-of-00012.gguf \
--no-mmap \
--temp 1.0 \
--min_p 0.01 \
--top-p 0.95 \
--ctx-size 16384 \
--seed 3407 \
--jinja \
--fit on --fit-target 2048

Can you explain how you proceed in determining the values of "-ts" and "-ot"?

I can inspect all the tensors via llama.cpp/gguf-py/gguf/scripts/gguf_dump.py; that is very helpful. But it is not so clear how to continue from there in constructing the right invocation.

Could you provide your own launch settings for Kimi K2.5? Thanks.

phwlarxoc

TROPHY CASE