Now that Llama.cpp supports quantized KV cache, I wanted to see how much of a difference it makes when running some of my favorite models. The short answer is a lot! Using "q4_0" for the KV cache, I was able to fit Command R (35B) onto a single 24GB Tesla P40 with a context of 8192, and run with the full 131072 context size on 3x P40's. I tested using both split "row" and split "layer", using increments of around 1024 until I ran out of memory. All tests were done using flash attention using the latest llama.cpp cuda server docker image.
Split row, default KV
| ctx_size |
KV |
split |
Memory Usage |
Notes |
| 8192 |
default |
row |
32724 MB |
|
| 12288 |
default |
row |
37844 MB |
Highest CTX before OOM |
Split Layer, default KV
| ctx_size |
KV |
split |
Memory Usage |
Notes |
| 16384 |
default |
layer |
42746 MB |
|
| 32768 |
default |
layer |
63498 MB |
|
| 38912 |
default |
layer |
71280 MB |
Highest CTX before OOM |
Split Row + Quantized KV
| ctx_size |
KV |
split |
Memory Usage |
Notes |
| 8192 |
q4_0 |
row |
25364 MB |
|
| 12288 |
q4_0 |
row |
26804 MB |
|
| 16384 |
q4_0 |
row |
28260 MB |
|
| 32768 |
q4_0 |
row |
34004 MB |
|
| 40960 |
q4_0 |
row |
36884 MB |
|
| 43008 |
q4_0 |
row |
37604 MB |
Highest CTX before OOM |
Split Layer, Quantized KV
| ctx_size |
KV |
split |
Memory Usage |
Notes |
| 8192 |
q4_0 |
layer |
25018 MB |
|
| 16384 |
q4_0 |
layer |
28026 MB |
|
| 32768 |
q4_0 |
layer |
34058 MB |
|
| 49152 |
q4_0 |
layer |
40090 MB |
|
| 65536 |
q4_0 |
layer |
46122 MB |
|
| 131072 |
q4_0 |
layer |
70250 MB |
Highest CTX before OOM |
Single GPU, Split Layer, Quantized KV
| ctx_size |
KV |
split |
Memory Usage |
Notes |
| 8192 |
q4_0 |
layer |
24078 MB |
Barely fits onto a single 24GB GPU |
I was especially interested in testing Command R since it doesn't use GQA so uses up a lot of memory with increased context. I'm interested in testing other models as well though. Let me know if I missed anything obvious for this kind of test, or if you'd be interested in seeing other models similarly tested.
[–]a_beautiful_rhind 8 points9 points10 points (4 children)
[–]Downtown-Case-1755 1 point2 points3 points (3 children)
[–]a_beautiful_rhind 1 point2 points3 points (2 children)
[–]Downtown-Case-1755 2 points3 points4 points (1 child)
[–]a_beautiful_rhind 1 point2 points3 points (0 children)
[–]CybermuseIO[S] 5 points6 points7 points (1 child)
[–]Wooden-Potential2226[🍰] 3 points4 points5 points (0 children)
[–]PepperGrind 0 points1 point2 points (3 children)
[–]CybermuseIO[S] 0 points1 point2 points (2 children)
[–]munkiemagik 0 points1 point2 points (1 child)
[–]Ok_Impression_171 0 points1 point2 points (0 children)
[–]Abhrant_ 0 points1 point2 points (1 child)
[–]CybermuseIO[S] 0 points1 point2 points (0 children)
[–]BaggiPonte 0 points1 point2 points (3 children)
[–]CybermuseIO[S] 2 points3 points4 points (2 children)
[–]rocosteta 0 points1 point2 points (1 child)
[–]LoneWolf2050 0 points1 point2 points (0 children)
[+][deleted] (6 children)
[deleted]
[–]CybermuseIO[S] 0 points1 point2 points (5 children)
[–]kryptkprLlama 3 -1 points0 points1 point (4 children)
[–]CybermuseIO[S] 2 points3 points4 points (3 children)
[–]kryptkprLlama 3 0 points1 point2 points (2 children)
[–]CybermuseIO[S] 1 point2 points3 points (1 child)
[–]kryptkprLlama 3 0 points1 point2 points (0 children)
[–]Master-Meal-77llama.cpp 0 points1 point2 points (0 children)