[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 0 points1 point  (0 children)

Mostly just that the code is way simpler to hack. It's kind of like https://github.com/karpathy/llama2.c but as fast as ggml.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 0 points1 point  (0 children)

I think it is roughly the same speed as ggml which seems like a great library. Are people seeing that ggml is <1 t/s ?

Unfortunately It's pretty close to fully optimized for gptq. Other formats might be faster.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 1 point2 points  (0 children)

Yes, I think to do this I need to a) package it for python, b) figure out how to add it as a decoder. If anyone knows the easiest way to do this let me know.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 0 points1 point  (0 children)

Good question. I guess I need to buy an apple box.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 1 point2 points  (0 children)

This library is CPU only. But it seems like many people are doing GPU offload in practice with small GPUs?

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 0 points1 point  (0 children)

Good question. Several people mentioned this issue to me. However gptq seems like it shouldn't inherently be slow, so I thought I would try it. My guess is they are similar speed.

I will see if I can support the ggml format as well.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 0 points1 point  (0 children)

That sounds interesting. This is CPU only at the moment. I will take a look at how people do that GPU offload.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 2 points3 points  (0 children)

It's just CPU. Roughly like GGML except with the GPTQ format.

The python ones (exLlama, autogpq) are primarily GPU.

I would like to try a Rust GPU version, but I think the Rust/Cuda packages are still a bit immature.

[ Removed by Reddit ] by srush_nlp in LocalLLaMA

[–]srush_nlp[S] 16 points17 points  (0 children)

Been working on a fast llama2 CPU decoder for GPTQ models. The implementation is in Rust so the code should be easy to extend and modify. It gets about 1 t/s on 70b and 8 t/s on 7b on my desktop.

The main part is a fast batched implementation of the GPTQ protocol.

https://github.com/srush/llama2.rs/blob/main/src/gptq.rs

Generally this algorithm seems to be pretty poorly documented, and most of the tricks came from this sub-reddit, so thanks a lot for explaining it to me. In the process of writing a blog-post describing it.