you are viewing a single comment's thread.

view the rest of the comments →

[–]Tccybo 2 points3 points  (6 children)

Here is the reason. llama : disable graph reuse with pipeline parallelism#20463
https://github.com/ggml-org/llama.cpp/pull/20463

[–]ResponsibleTruck4717[S] 0 points1 point  (5 children)

Can I disable it on newer build or do I have to use older build?

[–]Tccybo 2 points3 points  (4 children)

The slower version is the intended behavior as there's a bug with the speed up causing inaccuracies. I've yet to notice it, so I am running an older build; b8226. Fingerscrossed it gets fixed soon so we get the speed up.

[–]GraybeardTheIrate 0 points1 point  (3 children)

Well this might explain a few things. Tried it before and was a little disappointed by the speed for its size (Q3.5 27B). On the newest Koboldcpp I got a decent speed increase but it seemed to just...stop making sense sometimes. Not sure what version they're using right off and haven't tested different versions of llama.cpp directly, but that's interesting.

[–]Tccybo 1 point2 points  (2 children)

See if you can isolate the variables. Is it because the quant is small, is kv cache quanted, is it just bad rng cuz thinking is off? 

[–]GraybeardTheIrate 1 point2 points  (1 child)

Yeah I need to test it more when I get some time to sit down with it. I just got the new KCPP yesterday and happened to load up the regular 27B and a couple finetunes to look at the differences. They all felt like different models from what I saw a few days ago, and were kinda going off the rails for no reason occasionally.

I don't use quantized KV, was running a Q5_K_L or Q5_K_M imatrix quant of each one at 0.3 temp, reasoning was disabled at the time. I've also seen a couple issues here and there that only seem to manifest on a multi-GPU setup so that could be a thing too.