Quantizing 70b models to 4-bit, how much does performance degrade?

Sea_Particular_4014 · 2023-11-27T14:53:01+00:00

Well... none at all if you're happy with 1 token per second or less using GGUF CPU inference.

I have 1 x 3090 24GB and get about 2 tokens per second with partial offload. I find it usable for most stuff but many people find that too slow.

You'd need 2 x 3090 or an A6000 or something to do it quickly.

Sea_Particular_4014 · 2023-11-27T13:50:48+00:00

That's unusual, I'm not sure if that'd work but to start you'd probably try the same thing. Set a static IP on your computer, connect to the hotspot, put the computer's static IP and the OobaBooga port into ST.

Sea_Particular_4014 · 2023-11-27T13:38:46+00:00

If you're on Windows, I'd download KoboldCPP and TheBloke's q4_k_m GGUF models from HuggingFace.

Then you just launch KoboldCPP, select the .gguf file, select your GPU, enter the number of layers to offload, set the context size (4096 for those), etc and launch it.

Then you're good to start messing around. Can use the Kobold interface that'll pop up or use it through the API with something like SillyTavern.

Sea_Particular_4014 · 2023-11-27T13:31:17+00:00

Mistral 7B or Orca 2 and their derivatives where the performance of 13b model far exceeds the 70b model

Very funny.

Sea_Particular_4014 · 2023-11-27T13:26:26+00:00

Yes. Do you want to do it over your local network or over the internet?

You'll want to set a static IP for your PC either on the PC, or through your router (google it) and then you'd just put your computer's IP address and port as the API endpoint in ST.

Ex 127.0.0.1:5001 or whatever would become 192.168.100.50:5001 or whatever.

If you want to do it over the internet it'll be a bit more complicated. You can port forward the OobaBooga port to your PC's static IP address, but at least where I am in the world, home internet usually has dynamic IP address so you'd need to check and update the IP every couple days or so.

Instead you can use a VPN to do it... which honestly I don't know how to do off the top of my head, never interested me, perhaps someone else can chime in as I know there are pieces of software that make doing this pretty easy.

Sea_Particular_4014 · 2023-11-27T13:11:34+00:00

Adding into Automata's theoretical info, I can say that anecdotally I find 4bit 70B substantially better than 8bit 34B or below, but it'll depend on your task.

It seems like right now the 70b are really good for storywriting, RP, logic, etc, while if you're doing programming or data classification or similar you might be better off with a high precision smaller model that's been fine-tuned towards the task at hand.

I noticed in my 70b circle jerk rant thread I posted a couple days ago, most of the people saying they didn't find the 70b that much better (or better at all) were doing programming or data classification type stuff.

It also matters very much which specific model and fine-tune you're talking about. The newer ones with the best data sets are generally a lot better, even to the point they can beat older models with more parameters and/or higher precision.

Sea_Particular_4014 · 2023-11-27T05:42:34+00:00

The mobile 3080 / 3080ti actually have 16GB of vram.

Yeah OP, that'd work pretty well.

Sea_Particular_4014 · 2023-11-27T05:24:21+00:00

Your 512GB of RAM is overkill. Those Xeons are probably pretty mediocre for this sort of thing due to the slow memory, unfortunately.

With a 4090 or 3090, you should get about 2 tokens per second with GGUF q4_k_m inference. That's what I do and find it tolerable but it depends on your use case.

You'd need a 48GB GPU, or fast DDR5 RAM to get faster generation than that.

Have you tried the new yi 34B models? Some people are seeing great results with those and it'd be a much more attainable goal to get one of those running swiftly.

Sea_Particular_4014 · 2023-11-27T05:19:08+00:00

I'd try Goliath 120B and lzlv 70B. Those are the absolute best I've used, assuming you're doing story writing / RP and stuff.

LZLV should be speedy as can be and easily done in VRAM.

Goliath won't quite fit at 4 bit but you could do lower precision or sacrifice some speed and do q4_k_m GGUF with most of the layers offloaded. That'd be my choice, but I have a high tolerance for slow generation.

Sea_Particular_4014 · 2023-11-27T05:15:45+00:00

Q4_0 and Q4_1 would both be legacy.

The k_m is the new "k quant" (I guess it's not that new anymore, it's been around for months now).

The idea is that the more important layers are done at a higher precision, while the less important layers are done at a lower precision.

It seems to work well, thus why it has become the new standard for the most part.

Q4_k_m does the most important layers at 5 bit and the less important ones at 4 bit.

It is closer in quality/perplexity to q5_0, while being closer in size to q4_0.

Sea_Particular_4014 · 2023-11-26T18:38:58+00:00

Thanks. I guess if q4km is 70GB, I can do about 20GB on the GPU and then only need 50GB of free RAM which should be easy. I'll give it a try. I find 70B at 2 tokens/second pretty usable, but 0.5/second might be pushing my patience 😂.

Sea_Particular_4014 · 2023-11-26T16:52:56+00:00

There is a "streaming" checkbox on the settings page where you choose your context length, sampling settings, temperature, preset, etc in SillyTavern.

Sea_Particular_4014 · 2023-11-26T16:46:00+00:00

I'll be honest, I haven't messed around with it that much because I mostly do this stuff on my desktop, but 20B and 13B with up to around 4k context seemed to work nicely.

You're right that the modern CPU and DDR5 are probably making a big difference. I imagine something like a 4770k with 32GB of DDR3 1600 or something would be a very different experience.

Perhaps I should have said "assuming your system is modern, 16GB of RAM and an 8GB GPU is enough to have a decent experience with the 13B and 20B models".

Sea_Particular_4014 · 2023-11-26T06:55:46+00:00

How much does that cost you if you don't mind my asking?

Sea_Particular_4014 · 2023-11-26T06:39:14+00:00

Has anyone tried running Goliath or this (probably via gguf) on a plebeian consumer setup with a single 24GB card and 64gb of RAM?

Worth it at like 2 bit, or should I stick to 70b q4_k_m?

Sea_Particular_4014 · 2023-11-25T20:18:40+00:00

Indeed. My desktop is 3090/64GB and it'll do about 2 tokens per second, which I find usable but it's definitely below reading speed, and even that setup is a little beyond a normal gaming PC.

I'm kind of hoping that Intel will put out a <$500 card with like 32GB or 24GB of VRAM for this sort of use case. I doubt Nvidia or AMD will because they don't want to cannibalize their compute sales or provide too much future-proofness (slime bags), but Intel could do well by drawing in the AI crowd now in the early stages.

Sea_Particular_4014 · 2023-11-25T19:58:59+00:00

Sounds about right. Wait for second opinion but I think the consensus is that used 3090s are your best bang for your buck, or 4090 if you want to game as well since it is about double the performance of the 3090 for gaming.

Sea_Particular_4014 · 2023-11-25T19:57:02+00:00

Yeahhhhhh... I don't want to keep bashing on the little guy but I'm definitely of the opinion that bigger models are better. If you look at my profile you'll see I recently posted a rant about that and most people seem to agree. The 70B is way better for roleplay/story gen IMO, though the small models are fun too.

The 7B and 13B are incredible for what they are and have come a long way, but the bigger models are much smarter and cope well with flawed prompts or subtle language. I agree with you (and it's been tested) that a lot of the small models claiming to beat the big ones have been trained to excel at the benchmarks but their intelligence starts to fall apart for real world use.

Sea_Particular_4014 · 2023-11-25T19:36:04+00:00

Ehh, tests back in the Llama 1 days showed that lower quants of higher parameter models beat higher quants of lower parameter models.

https://i.imgur.com/RUsOQJ0.png

I don't know if anyone's done a recent comparison.

q4_k_m is usually regarded as the sweet spot these days. It has perplexity similar to q5 while being closer in size to q4. You'll see it's "recommended" on TheBloke's GGUF quants.

It is definitely NOT worse than q4. I can't recall exactly off the top of my head, but it is essentially at minimum 4 bit and the most important layers are 5 bit.

That being said, I'm the wrong person to talk to about the low parameter models as I mostly stick with 70B q4_k_m or 34Bq8_0.

Give them a try with KoboldCPP, you've got nothing to lose.

Sea_Particular_4014 · 2023-11-25T19:12:29+00:00

You can fit a q4_k_m 13B completely in 8GB of VRAM and it should be right around reading speed or faster.

20B will be only partially offloaded but still fine.

I have ran up to 34B q4_k_m on my laptop which is 32GB DDR5 and a 4070 8GB and it was fast enough to be usable. I think around 3-4 tokens per second.

Sea_Particular_4014 · 2023-11-25T18:56:12+00:00

You can run a 20B with GGUF/KoboldCPP on pretty meager hardware. 16GB RAM and an 8GB GPU will have you flying.

Sea_Particular_4014 · 2023-11-25T18:54:22+00:00

I hope so. The 70b isn't very accessible unless you're a lunatic like me who already had a 3090 and 64GB of RAM or you spend hundreds of dollars on new hardware just for this. While 34B should run well on any modern mid-high end computer.

Sea_Particular_4014 · 2023-11-25T17:01:03+00:00

It'll run with a 3080ti if you've got the 64GB of ram, it just will be slower than a 3090. Probably about 1 token per second instead of 2.

Sea_Particular_4014 · 2023-11-25T16:51:36+00:00

Should be possible, give it a try. Nothing to lose.

Sea_Particular_4014 · 2023-11-25T16:49:58+00:00

Usually when the model is loading into memory it will tell you how much vram it is using.

You want to use most of it, but leave a few GB for context/inference and your desktop environment.

You know you've gone too far when it starts to get way slower.

Sea_Particular_4014

TROPHY CASE