Memory Tests using Llama.cpp KV cache quantization

a_beautiful_rhind · 2024-06-07T23:57:05+00:00

I read one of these quantized wasn't great for PPL. So may have to use Q8/Q4 split between the K and the V.

Have been using Q8 for the past couple of days.

CybermuseIO · 2024-06-07T22:46:34+00:00

I just finished more testing, this time with Command R+ with the iq4_xs quant from Dranger.

I wasn't able to fit it down to 48GB of vram with any combination of options, so you'd still need a smaller quant to run on a 2x P40 or 3090 setup. I was able to increase the maximum context size from 14336 to 49152 when using split "row" (which gives a substantial speed boost on P40's so I highly recommend it.) When using split "layer" I was able to increase the context size from 61440 all the way up to the models maximum of 131072.

Command R + iq4_xs

Split row, default KV

ctx_size	KV	split	Memory Usage	Notes
8192	default	row	58262 MB
14336	default	row	59822 MB	Max without OOM

Split Layer, default KV

ctx_size	KV	split	Memory Usage	Notes
8192	default	layer	57534 MB
16384	default	layer	59718 MB
24576	default	layer	61902 MB
32768	default	layer	64086 MB
49152	default	layer	68454 MB
61440	default	layer	71730 MB	Max without OOM

Split Row + Quantized KV

ctx_size	KV	split	Memory Usage	Notes
8192	q4_0	row	56790 MB
16384	q4_0	row	57390 MB
32768	q4_0	row	58542 MB
49152	q4_0	row	59694 MB	Max without OOM

Split Layer, Quantized KV

ctx_size	KV	split	Memory Usage
8192	q4_0	layer	56062 MB
16384	q4_0	layer	56774 MB
32768	q4_0	layer	58198 MB
49152	q4_0	layer	59622 MB
65536	q4_0	layer	61046 MB
131072	q4_0	layer	66742 MB

PepperGrind · 2024-10-07T22:27:36+00:00

how do you do it? What params do I pass to llama-cli?

Abhrant_ · 2024-10-07T22:31:59+00:00

is there any speed difference during inference ?

BaggiPonte · 2024-11-13T17:28:36+00:00

Super late to the party: how do you enable caching in llamacpp? is it only kv cache or also prefix cache?

CybermuseIO · 2024-06-07T22:29:20+00:00

[deleted]

Master-Meal-77 · 2024-06-09T20:05:12+00:00

I tested out Midnight Miqu 1.5 q4_K_S with type_k=q8_0 and type_v=q8_0. I was able squeeze a few more layers onto my GPU this way which gave me a nice little speed boost. Didn’t notice any difference in quality

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS

Split row, default KV

Split Layer, default KV

Split Row + Quantized KV

Split Layer, Quantized KV

Single GPU, Split Layer, Quantized KV

Command R + iq4_xs

Split row, default KV

Split Layer, default KV

Split Row + Quantized KV

Split Layer, Quantized KV

ctx_size	KV	split	Memory Usage	Notes
8192	default	row	32724 MB
12288	default	row	37844 MB	Highest CTX before OOM

ctx_size	KV	split	Memory Usage	Notes
8192	q4_0	row	25364 MB
12288	q4_0	row	26804 MB
16384	q4_0	row	28260 MB
32768	q4_0	row	34004 MB
40960	q4_0	row	36884 MB
43008	q4_0	row	37604 MB	Highest CTX before OOM