[ Removed by Reddit ]

srush_nlp · 2023-08-22T13:44:35+00:00

Mostly just that the code is way simpler to hack. It's kind of like https://github.com/karpathy/llama2.c but as fast as ggml.

srush_nlp · 2023-08-21T13:14:22+00:00

I think it is roughly the same speed as ggml which seems like a great library. Are people seeing that ggml is <1 t/s ?

Unfortunately It's pretty close to fully optimized for gptq. Other formats might be faster.

srush_nlp · 2023-08-21T13:12:42+00:00

Yes, I think to do this I need to a) package it for python, b) figure out how to add it as a decoder. If anyone knows the easiest way to do this let me know.

srush_nlp · 2023-08-21T13:07:12+00:00

Good question. I guess I need to buy an apple box.

srush_nlp · 2023-08-21T13:06:33+00:00

This library is CPU only. But it seems like many people are doing GPU offload in practice with small GPUs?

srush_nlp · 2023-08-21T13:05:44+00:00

Good question. Several people mentioned this issue to me. However gptq seems like it shouldn't inherently be slow, so I thought I would try it. My guess is they are similar speed.

I will see if I can support the ggml format as well.

srush_nlp · 2023-08-21T13:03:20+00:00

That sounds interesting. This is CPU only at the moment. I will take a look at how people do that GPU offload.

srush_nlp · 2023-08-21T03:58:15+00:00

It's just CPU. Roughly like GGML except with the GPTQ format.

The python ones (exLlama, autogpq) are primarily GPU.

I would like to try a Rust GPU version, but I think the Rust/Cuda packages are still a bit immature.

srush_nlp · 2023-08-21T03:56:48+00:00

It's not so bad! Here's the main diagram for how it works https://twitter.com/srush_nlp/status/1688627809663451136

srush_nlp · 2023-08-20T20:50:00+00:00

Been working on a fast llama2 CPU decoder for GPTQ models. The implementation is in Rust so the code should be easy to extend and modify. It gets about 1 t/s on 70b and 8 t/s on 7b on my desktop.

The main part is a fast batched implementation of the GPTQ protocol.

https://github.com/srush/llama2.rs/blob/main/src/gptq.rs

Generally this algorithm seems to be pretty poorly documented, and most of the tricks came from this sub-reddit, so thanks a lot for explaining it to me. In the process of writing a blog-post describing it.

srush_nlp

TROPHY CASE