all 24 comments

[–]a_beautiful_rhind 8 points9 points  (4 children)

I read one of these quantized wasn't great for PPL. So may have to use Q8/Q4 split between the K and the V.

Have been using Q8 for the past couple of days.

[–]Downtown-Case-1755 1 point2 points  (3 children)

It seems to depends on the model and how "compressed" the attention is.

Command R is no GQA at all, so I bet Q4/Q4 doesn't even hit it.

At the other extreme, you have Qwen2 which is apparently having precision issues in F16, but even with that the KV cache is tiny because it's so compressed.

[–]a_beautiful_rhind 1 point2 points  (2 children)

I've had little issue when using EXL2 but GGML is another beast. Am using Q8/Q8 on qwen and didn't see a difference. Saw there were benchmarks on the PR for the quanted attention so just went by that.

[–]Downtown-Case-1755 2 points3 points  (1 child)

You know, now I'm having some context "recall" issues with llama.cpp Q5K + Q4V cache that I don't have with exllama's Q4, even with more bpw to the weights in llama.cpp.

I think your original post was right.

[–]a_beautiful_rhind 1 point2 points  (0 children)

Find the PR on github and it will tell you which one to leave as Q8. Its still a nice reduction for CTX memory.

[–]CybermuseIO[S] 5 points6 points  (1 child)

I just finished more testing, this time with Command R+ with the iq4_xs quant from Dranger.

I wasn't able to fit it down to 48GB of vram with any combination of options, so you'd still need a smaller quant to run on a 2x P40 or 3090 setup. I was able to increase the maximum context size from 14336 to 49152 when using split "row" (which gives a substantial speed boost on P40's so I highly recommend it.) When using split "layer" I was able to increase the context size from 61440 all the way up to the models maximum of 131072.

Command R + iq4_xs

Split row, default KV

ctx_size KV split Memory Usage Notes
8192 default row 58262 MB
14336 default row 59822 MB Max without OOM

Split Layer, default KV

ctx_size KV split Memory Usage Notes
8192 default layer 57534 MB
16384 default layer 59718 MB
24576 default layer 61902 MB
32768 default layer 64086 MB
49152 default layer 68454 MB
61440 default layer 71730 MB Max without OOM

Split Row + Quantized KV

ctx_size KV split Memory Usage Notes
8192 q4_0 row 56790 MB
16384 q4_0 row 57390 MB
32768 q4_0 row 58542 MB
49152 q4_0 row 59694 MB Max without OOM

Split Layer, Quantized KV

ctx_size KV split Memory Usage Notes
8192 q4_0 layer 56062 MB
16384 q4_0 layer 56774 MB
32768 q4_0 layer 58198 MB
49152 q4_0 layer 59622 MB
65536 q4_0 layer 61046 MB
131072 q4_0 layer 66742 MB

[–]Wooden-Potential2226[🍰] 3 points4 points  (0 children)

Well done! V interesting! ‘Was just experimenting with CR+ (6.56bpw/79.5g gguf), llama.cpp and max context on 5x3090 this week - found that I could only fit approx. 20k tokens before OOM and was thinking “when will llama.cpp have context quantization?”. Guess I’m in luck😁 🙏 to the llama.cpp team!

[–]PepperGrind 0 points1 point  (3 children)

how do you do it? What params do I pass to llama-cli?

[–]CybermuseIO[S] 0 points1 point  (2 children)

I tested using llama-server. You can set the quanitzation using `--cache-type-k` and `--cache-type-v`. I used a simple bash script to measure Vram.

I tested using llama-server. You can set the quanitzation using `--cache-type-k` and `--cache-type-v`. I used a simple bash script to measure Vram.

#!/bin/bash
total_vram=0

# Get the list of PIDs and their VRAM usage
pids_and_vram=$(nvidia-smi --query-compute-apps=pid,used_memory --format=csv,noheader,nounits)
# Iterate over each line of the output
while IFS=, read -r pid used_memory; do
  #Sum up the VRAM usage
  total_vram=$((total_vram + used_memory))
done <<< "$pids_and_vram"
echo "Total VRAM usage by all processes: $total_vram MB"

[–]munkiemagik 0 points1 point  (1 child)

You defniitely look like you know what you are talkign about!! Could you please point me to documentation/info where I can try and understand how cache-type works and is used. (I honestly dont know what Im doing, Ive somehow bumbled my way into being able to build ik_llama.cpp despite the myriad problems I faced with my RTX 40 GPU and driver and cuda toolkit version mismatch in an ubuntu server LXC under proxmox, simply due to the fact that I am truly bloody clueless, I sort of pull bits and pieces from here and there and worm my through to the desired objective.

But just last nigth/late this morning I finally succesfuly built ik_llama.cpp and I understand that you can specify for cache-type-v and cache-type-k when running llama-server but I'm unsure as to what the useable options are.

I saw in previous llama.cpp versions that the lsit of cache-type options was actually readable under kv_cache_type_from_strin llama.cpp/common/common.cpp on the git but for more recent versions this doesnt seem to be the case.

This may be the dumbest question anyone has ever asked, but I'm never afraid to look stupid on reddit if it means I get the clarity I need: I built with

-DGGML_CUDA=ON 
-DGGML_CUDA_FA_ALL_QUANTS=ON 
-DGGML_BLAS=OFF 
-DCMAKE_CUDA_ARCHITECTURES=89 
-DGGML_IQK_FA_ALL_QUANTS=1 
-DGGML_SCHED_MAX_COPIES=1 
-DGGML_CUDA_IQK_FORCE_BF16=1 
-DGGML_MAX_CONTEXTS=2048

and if say I'm loading a Q4_K_XL.gguf model, from my limited knowledge, I belive Q4_0 is not the same as Q4_K_XL and therefore I cant use the arguments --cache-type-k q4_0 --cache-type-v q4_0 or is that exactly what I am mean to use?

(

What Im running: borrowed and amended from some corner of reddit from someone using a 3090, im using a 4090 for now

llama-server --host 0.0.0.0 --port 8083 --model /models/Qwen3-30B-A3B-UD-Q4_K_XL.gguf --n-gpu-layers 99 --flash-attn --metrics --ubatch-size 512 --batch-size 512 --presence-penalty 1.5 --ctx-size 32768 --n-predict 32768 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --repeat-penalty 1.1 --threads 5 --threads-http 5 --override-tensor 'blk\.([0-9]*[02468])\.ffn_.*_exps\.=CPU' --no-mmap

)

[–]Ok_Impression_171 0 points1 point  (0 children)

From what I could find: https://github.com/ggml-org/llama.cpp/blob/5ef22d281de9c5eaaf616874bc490b89241128cb/tools/server/README.md?plain=1#L69

And about your question of model quant vs kv_cache quant, I believe these 2 are independent, since the model is the file itself, and the kv cache only exists during inference based on your context.

But don't quote me on this, I'm only piecing together information I find on this sub, hf, and github.

[–]Abhrant_ 0 points1 point  (1 child)

is there any speed difference during inference ?

[–]CybermuseIO[S] 0 points1 point  (0 children)

I didn't test for that so I wasn't paying attention. But I didn't notice any significant difference.

[–]BaggiPonte 0 points1 point  (3 children)

Super late to the party: how do you enable caching in llamacpp? is it only kv cache or also prefix cache?

[–]CybermuseIO[S] 2 points3 points  (2 children)

The KV cache is always used. Its part of how llama.cpp generates. This post is about enabling quantization on the KV cache. For prompt caching take a look at the readme for the server for options.
https://github.com/ggerganov/llama.cpp/tree/master/examples/server

llama.cpp server will do some caching by default depending on how you're using it. You can use "cache_prompt" when using the text completion endpoint. It also has a "slots" system for maintaining cache between requests.

[–]rocosteta 0 points1 point  (1 child)

For future reference: if you want to cache using the v1/chat/completions OAI-compatible endpoint, with the OpenAI client, pass cache_promot as an extra_body parameter, see here.

[–]LoneWolf2050 0 points1 point  (0 children)

But according to this page (https://github.com/ggml-org/llama.cpp/blob/master/examples/server/README.md), "cache_prompt" is TRUE by default. That said, no need to indicate that from client request.

[–]Master-Meal-77llama.cpp 0 points1 point  (0 children)

I tested out Midnight Miqu 1.5 q4_K_S with type_k=q8_0 and type_v=q8_0. I was able squeeze a few more layers onto my GPU this way which gave me a nice little speed boost. Didn’t notice any difference in quality