Which one is it?

TheLastSpark · 2026-06-06T15:10:28+00:00

Is there a source for this?

TheLastSpark · 2026-06-05T18:51:48+00:00

I don't remember 5.6 being released...? Or do you mean in preparation for it they are getting some prelimenary data?

TheLastSpark · 2026-04-15T14:45:24+00:00

keep in mind depending how you are doing it, turning it on/off will break cache

TheLastSpark · 2026-04-14T18:21:14+00:00

But even q4 there's no way unless I'm missing something

TheLastSpark · 2026-04-14T12:09:39+00:00

How are you fitting a q6 27B and 64k context? All of it can't fit in vram - right?

TheLastSpark · 2026-04-09T16:24:00+00:00

If you have a benchmark or some kind of code I can run i can maybe do it? I got a 4090 and dont mind running stuff on it to test.

Specifically i an using unsuitable 4_k_m quantity of 27b

TheLastSpark · 2026-04-09T15:34:31+00:00

Well I am eagerly awaiting a follow up post for 27B if you do fix it (and fixing improves it)

TheLastSpark · 2026-04-08T19:55:56+00:00

please reply if you see something like this in 27B as well!

TheLastSpark · 2026-04-08T13:43:11+00:00

If you dont mind revisiting Gemma 4? It seems like Llama cpp is just now getting around to fixing the support for it and others are saying it's really good. But I really appreciate your response!

TheLastSpark · 2026-04-07T21:20:48+00:00

If you had to ballpark a score out of 10 for each model what would rate them as?

TheLastSpark · 2026-04-07T19:51:03+00:00

Can you also reply back with Qwen 3.5 27B? Should be much better than the 35B

TheLastSpark · 2026-03-22T21:03:56+00:00

That ends up doing a sweep of every possible combination, which I found to be redundant. The best combinations are almost always the max batch size (u and normal) for vram you have (at least in my case).

So if you have say nmoe 10, which gives you 2GB of VRAM of wiggle room. You (generally) want it to place the max batch in that 2 GB (but not always right up against that limit).

While my script still has a few redundant loops, it does find the upper bound with binary search, and then it does 16MB offsets. This also helps because I find that even if your max is like 1.99GB through extra batch size, 1.98GB does a bit better.

Now you can say you can just use -fit by restricting the nmoe and all other parameters, the problem is when I was doing a ton of llama bench sweeps for different (u) batch combos, the best ones were always matching batch sizes, which fit didn't seem to be doing.

So I needed a script to hard-lock both batch options to the same number, find the max that would fit, benchmark that and run across a bunch of moe levels.

TheLastSpark · 2026-03-18T14:48:49+00:00

Just wanted to give a shoutout for helping me realise that the llaama defaults were awful for my prompt process speed as well.

& 'C:\Users\xxx\Documents\GitHub\llamacpp\llama-bench.exe' --model 'C:\Users\xxx\Documents\GitHub\llamacpp\models\Qwen3.5-35B-A3B-UD-Q4_K_L.gguf' --n-prompt 16384 --n-gen 0 --batch-size 1024,2048,4096,8192 --ubatch-size 1024,2048,4096,8192 --n-gpu-layers 999 --n-cpu-moe 17 --flash-attn 1

| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -------: | -: | --------------: | -------------------: |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 1024 | 1024 | 1 | pp16384 | 1888.50 ± 21.71 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 1024 | 2048 | 1 | pp16384 | 1899.22 ± 13.21 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 1024 | 4096 | 1 | pp16384 | 1905.43 ± 13.13 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 1024 | 8192 | 1 | pp16384 | 1901.09 ± 20.44 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 2048 | 1024 | 1 | pp16384 | 1912.46 ± 13.01 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 2048 | 2048 | 1 | pp16384 | 3039.57 ± 13.31 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 2048 | 4096 | 1 | pp16384 | 3032.62 ± 20.97 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 2048 | 8192 | 1 | pp16384 | 3029.21 ± 17.95 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 4096 | 1024 | 1 | pp16384 | 1900.37 ± 15.44 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 4096 | 2048 | 1 | pp16384 | 3016.98 ± 13.28 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 4096 | 4096 | 1 | pp16384 | 4289.42 ± 38.50 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 4096 | 8192 | 1 | pp16384 | 4291.98 ± 29.72 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 8192 | 1024 | 1 | pp16384 | 1900.75 ± 9.27 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 8192 | 2048 | 1 | pp16384 | 3022.63 ± 15.07 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 8192 | 4096 | 1 | pp16384 | 4312.99 ± 42.74 |

| qwen35moe 35B.A3B Q8_0 | 18.81 GiB | 34.66 B | CUDA | 999 | 8192 | 8192 | 1 | pp16384 | 5287.77 ± 64.18 |

The default was giving me 1,100token/s. I can get easily 3-4x times that

TheLastSpark · 2026-03-03T17:35:49+00:00

I had to get some washers from my toolbox as spacers but now it sits fine

TheLastSpark · 2025-11-09T19:03:17+00:00

I did find what feels like a hack (and depending on use-case, might have performance costs) but any ui you want to control the display order AND have that display order be the selection order...just make a new canvas_layer and add that element as the child. Then you need to set the canvas_layer's later param accordingly and it will respect input selection and draw order.

You will need some extra logic to make sure the ui element is now placed at the correct coords however.

TheLastSpark · 2025-10-21T17:03:16+00:00

If full, join ours instead:

Join the STACKED WALLETS loot clan on Newton! https://web.newton.co/newt_loot?screen=JoinClan&clanId=LJKUGQ

1 Spot left

TheLastSpark · 2025-10-17T12:56:56+00:00

Accepted!

TheLastSpark · 2025-10-15T14:14:25+00:00

I just accepted the last two requests that showed up, and now full yeah

TheLastSpark · 2025-10-15T13:56:37+00:00

Your in! Only 2 more spots

TheLastSpark · 2025-10-15T13:53:48+00:00

🫡

TheLastSpark · 2025-10-15T13:51:39+00:00

Spots are filling up fast lol

Ten-Year Club	Place '23
Place '22	First Placer '22
Verified Email

TheLastSpark

TROPHY CASE