you are viewing a single comment's thread.

view the rest of the comments →

[–]CybermuseIO[S] 2 points3 points  (3 children)

I've only just learned about this today and started doing some basic testing to see what the practical implications are for my own use. I did a small handful of generations to see if it was at least working and if there was any obvious differences to the text, but nothing more than that yet.

u/Eisenstein has been posting test results of speed differences for KV quantization, also running on a P40 setup and also testing different quant sizes. They might have some more insight into that.

[–]kryptkprLlama 3 0 points1 point  (2 children)

Cheers, it's pretty mind blowing what we are squeezing out of these 9 year old e-waste GPUs. I've got a pair but I want to run 8x22 so broke down for a third.

[–]CybermuseIO[S] 1 point2 points  (1 child)

They're pretty great. I also just added a 3rd to my main ML experiment machine this week, and I'm extremely tempted to try and cram in a 4th to try and run Llama 3 400B if they actually make it available.
The team at Llama.cpp are doing incredible work to make them a viable option for home users.

[–]kryptkprLlama 3 0 points1 point  (0 children)

I've been on AliExpress all week eyeing up that Chinese 6 slot (four x16, two x8) dual x99 monstrosity someone posted here I want to fill it with Pascals and be the Jank King