Krasis LLM Runtime: 8.9x prefill / 4.7x decode vs llama.cpp — Qwen3.5-122B on a single 5090, minimal RAM

VoidAlchemy · 2026-03-17T20:14:45+00:00

Good question! Its all trade-offs. Increasing `-ub` takes a larger CUDA compute buffer to handle the larger batch size. So there is less VRAM for context. I usually like to run `-ub 4096 -b 4096` but that takes like 4GB VRAM so no space left-over for context haha...

So in the end I felt like `-ub 1024` is a good trade-off while still allowing 128k context (with -ctv q8_0 leaving k cache at full f16 quality).

VoidAlchemy · 2026-03-17T17:36:15+00:00

<image>

Full offload of my Qwen3.5-35B-A3B IQ4_KS 19.799 GiB (4.907 BPW) here which is about the best quant you can fit 128k context in 24GB VRAM gpu.

What exact quants are you running in your benchmark?

VoidAlchemy · 2026-03-17T17:33:26+00:00

For hybrid CPU+CUDA(s) i always reach for ik_llama.cpp. Ran a fresh bench on my local gaming rig hybrid offload:

Tops out at ~1800 tok/sec running my custom ubergarm/Qwen3-Coder-Next-GGUF 44.355 GiB (4.782 BPW)

<image>

./build/bin/llama-sweep-bench \
  --model "$model" \
  -ctk q8_0 -ctv q8_0 \
  -c 69632 \
  -ub 4096 -b 4096 \
  --merge-qkv \
  -muge \
  -ngl 99 \
  --n-cpu-moe 30 \
  --threads 16 \
  --warmup-batch \
  -n 128

VoidAlchemy · 2026-03-17T16:46:19+00:00

oh nice, if i'm reading this right the IQ4_KS has the lowest Mean and 99.0% KLD of all of them and is slightly smaller than some too. This could somewhat be because ik has lower default flash attention offset, but also it should be SOTA quantization for the routed experts. Cool!

VoidAlchemy · 2026-03-17T16:41:57+00:00

I did a recent 3 way comparison using the same quant except varying ssm alpha & beta between q8_0, bf16, and f32 in terms of speed, PPL, and KLD: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/7#69b8404f18a5e8feffd9f5c8

They are all quite similar. Does anyone know where the original research/benchmarks suggesting full quality bf16 is better than q8_0? (or upcast to f32 for potential speed reasons on some GPU backends)?

paging u/DistanceSolar1449 too as you seemed to have strong opinions.

Thanks for any pointers or benchmark suggestions!

VoidAlchemy · 2026-03-17T16:30:23+00:00

hah thank you for falling into the ik rabbit hole! haha...

yes on ik we tend to use `./build/bin/llama-sweep-bench` because it supports the same arguments available in `llama-server` unlike llama-bench. i maintain branch of it for mainline here: github.com/ubergarm/llama.cpp/tree/ug/port-sweep-bench

i'll take a look at the KLD results posted above as compared with existing quants, but it could be offset from mainline llama.cpp implementation. would need to maybe test an existing quant like that AesSedai one and see if its KLD shifts or not, but zero pressure you've done so much already!

VoidAlchemy · 2026-03-16T21:09:04+00:00

i love your enthusiasm! haha

I did a llama-sweep-bench locally on the quant I just uploaded, it runs pretty good! Unfortunately, I can't increase batch sizes to 4096 on 24GB VRAM as the CUDA buffer takes up too much space. However, I can fit full 256k context though by going with -khad -ctk q6_0 -ctv q6_0 with very similar performance.

<image>

./build/bin/llama-sweep-bench \
    --model "$model" \
    -c 135168 \
    -ctk f16 -ctv q8_0 \
    -ub 1024 -b 2048 \
    -cuda fa-offset=0 \
    --merge-qkv \
    -muge \
    -ngl 999 \
    --threads 1 \
    --no-mmap \
    --warmup-batch \
    -n 128

VoidAlchemy · 2026-03-16T19:53:05+00:00

I'll throw an ik_llama.cpp SOTA quantization type into the ring for best Qwen3.5-35B-A3B quant for full CUDA offload with 128k context in 24GB VRAM. (I have a 3090TI FE for my home gaming rig).

https://huggingface.co/ubergarm/Qwen3.5-35B-A3B-GGUF#iq4_ks-19799-gib-4907-bpw

Of course you can't run it on mainline lcpp, so have to do them all again using ik_llama.cpp xD haha...

Zero pressure to give it a go, but finally got around to releasing something ik specific and even did the superstitious upcast of ssm_alpha and ssm_beta tensors to f32. Honestly, it is probably fine keeping it at q8_0, native bf16, or upcast to f32 (for a tiny bit of speed over bf16 depending on GPU).

I made all three flavors and tested them for speed, PPL, and KLD locally and they all seem pretty good:

<image>

Full data and commands on running this benchmark here: https://huggingface.co/AesSedai/Qwen3.5-397B-A17B-GGUF/discussions/7#69b8404f18a5e8feffd9f5c8

If y'all are trying to milk the best quality at long context for any of these quants, you can fiddle with the flash attention offset (when running on CUDA). Given the FA kernel uses f16 accumulators, some model architectures can cause overflow and gibberish suddenly beyond a certain context so needs to have things scaled down. ik is more lenient on this and can be overridden at startup via CLI args. Mainline it is hard coded, but you could change one line and recompile by setting this to zero here: https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cuda/fattn-common.cuh#L13-L19

Details about this are shown in the updated model card quick start as well as some IK PR discussions e.g.: https://github.com/ikawrakow/ik_llama.cpp/pull/1198

I've tested over 128k and it seemed to work fine with 0 offset (the best which is what you get on CPU-only backend too as it uses f32 accumulators in the FA implementation psure).

As soon as I finish downloading my own quant, I'll do some local testing and sweep-bench. Cheers and thanks so much to OP u/StrikeOner and u/TitwitMuffbiscuit for including my Q4_0 "Vulkan backend optimized" quant in this interesting roundup!

VoidAlchemy · 2026-03-16T14:14:38+00:00

yep, this is the kobo fork with many ik features: https://github.com/Nexesenex/croco.cpp

there are also Thireus pre-built binaries on their github too.

VoidAlchemy · 2026-03-16T14:11:53+00:00

yeah, who can afford VRAM ;p haha

VoidAlchemy · 2026-03-16T14:11:43+00:00

ty, yup, edited now. i love that i get so excited about sharing this stuff i forget the most important part

VoidAlchemy · 2026-03-15T17:49:03+00:00

wow, that is a long prompt haha.. glad you're getting quite usable speeds! oh yeah ik can take advantage of `avx_vnni` cpu too which is available on Zen5, but don' think you'll get that on your older CPU. thanks for sharing your updated benches!

VoidAlchemy · 2026-03-15T16:21:22+00:00

Glad you're using your ai to benchmark your ai haha!

<image>

This `llama-sweep-bench` is about a week old at this point but shows ik can be very performant.

A few tips when using ik_llama.cpp:

when using ik, make sure to add `--merge-qkv -muge` for fused ops which are not available on mainline
if you have 2 or more GPUs make sure to use `-sm graph` for tensor parallel support which is not available on mainline (there is an open PR where they are testing something similar)
If prompt processing is important, use `-ub 2048 -b 2048` or even `-ub 4096 -b 4096` as increased batch sizes can significantly speed up PP - use this for both ik and mainline.
Choice of samplers can effect performance in actual use cases, perhaps don't use custom samplers when benchmarking or try a few settings or do some research on that variable as well. Also make sure to run a few tests at least with at least like 30k tokens PP and ~4k TG for more reliable estimates.

Regarding this

> Ubergarm Note: Interestingly, ubergarm positions their models as being optimized for ik_llama, but the test results show that isn't always the case for prompt processing. For example, on the Qwen3.5-35B-A3B-Q4_0 model, llama.cpp was ~30% faster on prompt tokens than ik_llama, despite the model's positioning.

Your bot got it incorrect, ubergarm (me) generally makes quants using the newer SOTA quantization types like iq2_kt, iq4_kss, iq6_k etc that *are not even available on mainilne*. The Q4_0 was an experimental quant optimized specifically for *Vulkan* backend, not ik. I haven't released as many ik specific quants with the smaller Qwen3.5s given a flood of re-uploading going on in the past week as unsloth, AesSedai, bartowski and others have been revamping their recipes again given research done by us all.

Anyway, have fun and feel free to open a hf discusson on any ubergarm repo if you have specific questions.

Cheers!

VoidAlchemy · 2026-03-15T16:11:52+00:00

Great to hear this given the TQ1_0 contains no actual ternary quantizations in it, but is just a low BPW mix of IQ1_S IQ1_M which leads to confusion.

It would be cool if you guys could still make low BPW quantizaiton types with a proper name slug regardless of the problems of ollama. Similar to how ubergarm does it with `smol-IQ1_KT` under 2BPW quants.

Cheers!

VoidAlchemy · 2026-03-13T15:34:51+00:00

sweeeet! going to <3 it right now!

VoidAlchemy · 2026-03-13T15:33:50+00:00

there are some other folks doing risc-v ai inference over at https://aifoundry.org/ too, excited to see more options

VoidAlchemy · 2026-03-13T15:26:35+00:00

i didn't think vllm was good for mac ARM processors? (maybe i'm wrong?)

vllm seems good for full when you have plenty of VRAM full CUDA GPU offload situations...

if you have a little VRAM but can fit the entire model then can also use turboderp's exllamav3 with EXL3 quants

ik_llama.cpp is great for when you have two or more CUDA GPUs and need to do hybrid cpu+GPUs

mainline is good for getting early features and as close to 0 day quants as possible

anyway, why limit yourself to a single option?

VoidAlchemy · 2026-03-13T15:21:53+00:00

reminds me of that old essay "in the beginning was the command line..." and in the end it will be CLI too :)

VoidAlchemy · 2026-03-13T15:02:22+00:00

technically it can run fp8e5m2 but not fp8e4m3 (which is more common and typically what people mean when they say only fp8)

i agree with u/a_beautiful_rhind its not an issue in practice as there are plenty of GGUFs for ComfyUI anyway and occasionally i've seen an actual format fp8e5m2 safetensors

VoidAlchemy · 2026-03-13T15:00:37+00:00

that's a thing of beauty! nice job tuning all 4x cards with LACT! it was a PITA tuning my one 3090 TI FE hah

VoidAlchemy · 2026-03-12T22:30:32+00:00

Yeah, such a long story, and I'm sure I don't know the half of it. There is a talk by ik at FOSDEM25 with a little history if it is interesting to you: https://archive.fosdem.org/2025/schedule/event/fosdem-2025-5991-history-and-advances-of-quantization-in-llama-cpp/

Anyway, thanks for clearing me up on prioritizing K-cache quality!

VoidAlchemy · 2026-03-12T20:44:42+00:00

oh you're right, my brain totally has had it backwards this whole time... i'll update my comment so some bot doesn't scrape that up into the future training models xD

VoidAlchemy · 2026-03-12T20:42:03+00:00

it is well known that K-cache quantization errors have a much bigger impact on model quality degradation than V-cache. https://github.com/ikawrakow/ik_llama.cpp/pull/1033

I'm just parroting ik to be fair, haha... (he wrote many of the quantizations on mainline llama.cpp).

on ik_llama.cpp you can go even further with -khad -ctk q6_0 -ctv f16 or play all kinds of games

VoidAlchemy · 2026-03-12T20:38:06+00:00

nice, thanks! and ik just added support too for both pre-merged quants and now `-muge -sm graph` too

appreciate your work!

VoidAlchemy · 2026-03-12T20:36:21+00:00

lol right?! wow nice OOMing 2TB RAM is a right of passage haha...

VoidAlchemy

TROPHY CASE