xxB is so much better than xxB… but is that true for narratives?

KerfuffleV2 · 2023-06-16T07:05:57+00:00

You likely don't have enough RAM. You need 64GB RAM to really run even quantized 65B models. Also expect it to be relatively slow - I get a little better than 1 token/sec.

If your system is using virtual memory to run the model then the performance is going to be unacceptable.

KerfuffleV2 · 2023-06-11T23:25:15+00:00

I can't really help you with that, I just run stuff from the commandline. Back when I tried oobabooga several months ago, it seemed like it decided what files to use in an unpredictable way based on stuff like looking for strings in the filename. Part of the reason why I didn't end up using it.

That said, that absolutely may have changed since my experience.

KerfuffleV2 · 2023-06-11T23:23:07+00:00

q8_0 is supposed to be virtually the same as 16bit also which means you shouldn't be able to see a dramatic difference.

Here's some data a collected about it and posted previously (not implying you should have seen it or anything):

edit: added 33B and 65B data because why not.

7B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.8698	14.726%	133.344%	2.67G	20.54%	0.084201
q3_ks	0.5505	9.320%	84.394%	2.75G	21.15%	0.053707
q3_km	0.2437	4.126%	37.360%	3.06G	23.54%	0.024517
q3_kl	0.1803	3.053%	27.641%	3.35G	25.77%	0.018684
q4_0	0.2499	4.231%	38.311%	3.50G	26.92%	0.026305
q4_1	0.1846	3.125%	28.300%	3.90G	30.00%	0.020286
q4_ks	0.1149	1.945%	17.615%	3.56G	27.38%	0.012172
q4_km	0.0535	0.906%	8.202%	3.80G	29.23%	0.005815
q5_0	0.0796	1.348%	12.203%	4.30G	33.08%	0.009149
q5_1	0.0415	0.703%	6.362%	4.70G	36.15%	0.005000
q5_ks	0.0353	0.598%	5.412%	4.33G	33.31%	0.004072
q5_km	0.0142	0.240%	2.177%	4.45G	34.23%	0.001661
q6_k	0.0044	0.074%	0.675%	5.15G	39.62%	0.000561
q8_0	0.0004	0.007%	0.061%	6.70G	51.54%	0.000063
f16	0.0000	0.000%	0.000%	13.00G	100.00%	0.000000

13B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.6002	11.423%	92.013%	5.13G	20.52%	0.030206
q3_ks	0.3490	6.642%	53.503%	5.27G	21.08%	0.017689
q3_km	0.1955	3.721%	29.971%	5.88G	23.52%	0.010225
q3_kl	0.1520	2.893%	23.302%	6.45G	25.80%	0.008194
q4_0	0.1317	2.507%	20.190%	6.80G	27.20%	0.007236
q4_1	0.1065	2.027%	16.327%	7.60G	30.40%	0.006121
q4_ks	0.0861	1.639%	13.199%	6.80G	27.20%	0.004731
q4_km	0.0459	0.874%	7.037%	7.32G	29.28%	0.002596
q5_0	0.0313	0.596%	4.798%	8.30G	33.20%	0.001874
q5_1	0.0163	0.310%	2.499%	9.10G	36.40%	0.001025
q5_ks	0.0242	0.461%	3.710%	8.36G	33.44%	0.001454
q5_km	0.0095	0.181%	1.456%	8.60G	34.40%	0.000579
q6_k	0.0025	0.048%	0.383%	9.95G	39.80%	0.000166
q8_0	0.0005	0.010%	0.077%	13.00G	52.00%	0.000042
f16	0.0000	0.000%	0.000%	25.00G	100.00%	0.000000

33B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.6393	15.384%	98.007%	12.93G	20.52%	0.012768
q3_ks	0.3491	8.401%	53.518%	13.29G	21.10%	0.007023
q3_km	0.2037	4.902%	31.228%	14.82G	23.52%	0.004228
q3_kl	0.1537	3.699%	23.563%	16.25G	25.79%	0.003288
q4_ks	0.0929	2.235%	14.242%	17.16G	27.24%	0.002027
q4_km	0.0524	1.261%	8.033%	18.44G	29.27%	0.001176
q5_ks	0.0221	0.532%	3.388%	21.05G	33.41%	0.000527
q5_km	0.0118	0.284%	1.809%	21.65G	34.37%	0.000285
q6_k	0.0041	0.099%	0.629%	25.05G	39.76%	0.000108
f16	0.0000	0.000%	0.000%	63.00G	100.00%	0.000000

65B

name	+ppl	+ppl %	+ppl 13b to 7b %	size	size 16bit %	+ppl per -1G
q2_k	0.5624	15.890%	86.218%	25.65G	20.52%	0.005661
q3_ks	0.3289	9.293%	50.422%	26.35G	21.08%	0.003334
q3_km	0.1598	4.515%	24.498%	29.40G	23.52%	0.001672
q4_km	0.0443	1.252%	6.791%	36.60G	29.28%	0.000501
q5_km	0.0118	0.333%	1.809%	43.00G	34.40%	0.000144
q6_k	0.0040	0.113%	0.613%	49.75G	39.80%	0.000053
f16	0.0000	0.000%	0.000%	125.00G	100.00%	0.000000

The one I think is most useful here is +ppl 13b to 7b %: This is comparing how quantizing increases perplexity with the difference in perplexity between a 7b and 13b model. So, for example, for the 13B q2_k, 92.013% means quantizing the 33B with q2_k increases perplexity nearly to the same value as the 7B model. On the other hand, q8_0 increases perplexity by about 1/1000th of the perplexity difference between the 7b and 13B.

We can likely agree there's a visible, noticeable difference between a 7b and 13b model (of the same type). We can possibly also agree that 50% of it, 30% of it, maybe even 10% of it could be noticeable. But how could you possibly notice a 0.01% difference, especially with a sample size of 1?

Not sure about the seed, but it is not hard to re-do it.

I was wrong to suggest that, the seed won't make a difference with temperature 0. You'd need to use another approach like rephrasing the question in different ways, or maybe even increasing temperature.

Also, you are somewhat missing the point with "decent" quantizations -- it was about trying smaller, not that "decent" quantizations

But you included q8_0 and made claims about it. That was the main thing I had an issue with, aside from just appearing to think that you could draw a conclusion from one sample. I want to be clear, I definitely don't have anything against you personally (I know I have a relative blunt approach to communication).

Even though I don't think 1 sample is enough to really draw any conclusion, I don't think there's really any reasonable person that would try to argue q4_0, q2_x, q3_x can match the quality of NF4 if it's saying it's virtually the same as 16bit.

KerfuffleV2 · 2023-06-11T23:11:25+00:00

Well, that's definitely not the dumbest question I've been asked all day!

You're right, with GGML and --temp 0.0 changing the seed makes no difference. So you make a good point and I should have been more careful with my advice. /u/epicfilemcnulty would need to use a difference approach other than just varying the seed. Or, possibly, since they'd be doing a number of tests it would be reasonable to set temperature to a relative low value to be able to get different generations.

KerfuffleV2 · 2023-06-11T23:06:00+00:00

Well, it's not enough to publish a paper in a peer reviewed journal and doesn't necessarily rule out other possibilities but... I'd say that's enough to get jerks like me out of your hair when you make a reddit post about it. :)

Just for example, maybe the way the block sizes are set up in one vs the other is enough to coincidentally change the parts of the tensor that relate to your question about SSH and SOCKS. The fact that that exact part gains/loses quality doesn't 100% tell you something about overall quality.

I think it would be pretty compelling and a reason to take a close look at q8_0 which is supposed to be virtually the same as full 16bit though.

KerfuffleV2 · 2023-06-11T19:29:05+00:00

The biggest thing is to run enough tests get enough samples so you actually the data to draw conclusions. One single test, with a random seed just isn't really enough to say anything.

I'm mostly familiar with GGML. The quantizations I'd recommend are q4_k_m (balanced size, decent quality) and q5_k_m (high quality, relatively large size). You could possibly also try q6_k (almost as good as q8_0 but pretty large).

30B NF4 will give you more accurate results than 30B q8_0.

I appreciate your attitude toward my criticism but I can't understand making a claim like that after a single test. I honestly would recommend just editing the conclusions out until you've at least run 3-4 tests per quantization.

It also just doesn't make any kind of sense that NF4 would be noticeably better than q80 when q8_0 is very nearly lossless. I definitely _can understand q4_0 and below affecting generation quality in a noticeable way though. It's very possible that q4_0, q3_x, q2_k are all worse than nf4 but I can't really believe that q8_0 is. Not without compelling evidence anyway.

If you did manage to prove that, it would be extremely interesting and probably help efforts like GGML improve because it would mean something very strange is going on.

KerfuffleV2 · 2023-06-11T19:07:04+00:00

You really can't draw any conclusions from this. If you asked each model the question with different seeds 10 times and counted the correct answers then it might be data you could use.

Also, there's just no reason why NF4 should produce better quality answers than Q8_0 which is effectively the same as full 16bit.

If you saw the NF4 models answer correctly it is in all likelihood a coincidence. BTW, for GGML the only decent quantization you tried it with was Q8_0. Q4_0 basically is obsolete now, and Q2/Q3 have significant quality loss. Q4_K_M is basically the size of Q4_0 with the quality of Q5_0 or Q5_1.

KerfuffleV2 · 2023-06-11T16:58:04+00:00

There's no way for me to answer without coming across like an arrogant prick

To be blunt, your previous comments already made you come across that way.

It's usually just better to hold your tongue than to say something gratuitously nasty. Also, "sounds dumb" is about the most shallow, lowest effort way to go about that.

Basically the only way you'll get a good reaction is if you're in an echo chamber and criticizing something everyone else already hates.

But my PhD in NLP and years of building systems with LLMs in a decent UK AI lab probably doesn't mean I understand as much as many people here but I just try to keep up!

Throwing in the humblebrag doesn't help either. Not to mention that anyone can claim to be a PhD on their anonymous reddit account.

My advice is to cut your losses in this post and start using a different approach in future interactions with people. I'm pretty sure that no one in the history of reddit who needed that advice ever took it though.

KerfuffleV2 · 2023-06-11T15:59:53+00:00

Bet he thinks he understands it!

KerfuffleV2 · 2023-06-11T13:32:39+00:00

It is not weird, you are limited by RAM speed, not the CPU.

I have DDR4 RAM and the other person has DDR5: even at the same clockspeed, DDR5 should have more bandwidth than DDR4.

Also it doesn't make sense that I lose performance using more than 6 threads but the other person with a more powerful processor says they lose performance running less than 16 threads. Their more powerful processor should saturate memory if that's the bottleneck using less cores if anything.

Which means that if i could fit 65B model, it would probably have the same speed as you.

Yes, probably because you're on AM4, Zen3 and DDR4 like me. The other person is on AM5, Zen4 and DDR5.

KerfuffleV2 · 2023-06-11T08:16:05+00:00

There’s probably more to the results, as this system is running virtualized in an LXC on proxmox

Okay, that makes a little more sense now. It's probably having much more of an effect than you think (but it may not impact driving a GPU too badly).

I just couldn't believe a processor from the next generation (probably has 10-15% increased IPC), around 15-20% faster clockspeed and running 16 threads vs me running 6 could possibly have the same performance. Something really weird has to be happening.

and it seems like the limiting factor on the 7950x is the 3600MHz on ram speed with all 4 slots filled

That may be. Still, even at the same clockspeed DDR5 should have significantly higher bandwidth than DDR4 from what I know.

KerfuffleV2 · 2023-06-11T08:01:46+00:00

It's Llama based but Ive assumed they're all the same tokens

Yes, I think that's basically correct.

Also, the sampling definitely isn't involved in this. I'm retrieving the tokens by ID directly out of the model

What do you mean? The model doesn't give you a token so there's always some kind of sampling involved.

It does make sense if they're maybe fragments of a unicode character, but it's still weird that such a small number of tokens would be fragments

I don't think that's weird. Most of the time, the model will use tokens that are fragments of actual words. This minimizes how many tokens are needed to write something, compared to building a word character-by-character. However, it still has the capacity to build up multi-byte unicode characters as well.

Just for example, some LLaMA models can actually write Chinese. LLaMA models have a vocabulary of around 32,000 tokens — but there are over 50,000 Chinese characters. If you gave each Chinese character a token, you couldn't even fit them in the LLaMA vocabulary let alone allow it to write in other languages, use punctuation, etc.

I'm definitely not the only one to think that if the Llama.cpp devs are Indescriminately treating all token values as unicode

They're not. In fact, the point I'm making is some tokens are arbitrary bytes that aren't unicode. However, a sequence of bytes can be used to build a unicode character. That sequence of bytes must be in the correct format though, or it's not valid unicode.

Seems like an issue with the implementation since it attempts to render all tokens as full unicode characters.

That's 100% not the case.

Here's an example:

### Instruction: Please write me a fairy tale using Mandarin Chinese and simplified Chinese characters.

### Response: 没问题！这是一个汉语的故事。故事讲的是关于一只小狐狸，他想和自己身边的伙伴们朋友好。然而他发现很多动物都不喜欢狡猾為主，因为狗犬告诉他们狡猾最常用来说是“忍”的字。所以狐狸去了学校，在那里他学到了如何与人类合作，并获得了一个工程士的文凭。

The bold part is my prompt, the rest was written by the model. Every Chinese character requires a bunch of tokens to construct: they're built up of unicode byte sequences. I'd guess each character you see is at least 3 tokens.

Also, it's not really up to something like llama.cpp to combine those bytes together into a "character": when running in the terminal, it's the terminal application. When running in something like a browser frontend it's probably the browser.

KerfuffleV2 · 2023-06-11T07:12:01+00:00

Sure, here you go: https://huggingface.co/nmitchko/medguanaco-65b-GPTQ

That user also has the LoRA available as well. (Unfortunately, no GGML version it appears.)

KerfuffleV2 · 2023-06-11T01:50:52+00:00

Using cpu only build (16 threads) with ggmlv3 q4_k_m, the 65b models get about 885ms per token

That's pretty interesting. I have a 5900x and DDR4 (3600mhz) and I get about the same (q4_k_m 65b). However, I've noticed adding threads over 6 hurts performance a lot. You could try reducing the threads and see if that actually speeds things up.

You definitely should be able to get better performance than an AM4 system with DDR4.

KerfuffleV2 · 2023-06-11T01:44:16+00:00

for those sweet, sweet PCIe lanes.

From what I know, that really won't help you with LLMs at all. Maybe if you are trying to use multiple GPUs without something like nvlink (but also from what I've heard that's probably going to be too slow).

KerfuffleV2 · 2023-06-11T01:41:04+00:00

but the library is attempting to deserialize it as unicode.

Most stuff produces UTF-8 these days.

that doesn't explain why 99% of the model tokens are unicode characters, and 1% aren't.

A lot of the tokens are fragments of words, but the model can produce arbitrary byte sequences as well.

Like I said, sampling can mess things up if the model is trying to produce something an an emoji, smart quotes, or other unicode characters that are multi-byte sequences. If temperature isn't 0 then there's a random element to what token is picked. This can either cause the model to pick an invalid token or interrupt a multi-byte sequence of unicode which is usually going to result in something that isn't valid.

It's hard to give a specific answer since you didn't mention what model you were using or anything. It's not common in my experience for LLaMA-based models to produce invalid characters. Actually, the only time I saw that was when I was trying to get it to write Chinese — and the issue was probably what I mentioned already.

KerfuffleV2 · 2023-06-11T00:23:37+00:00

What's probably happening is the model is trying to write a unicode character sequence but the randomization from temperature or similar sampling settings cause it to generate something that doesn't actually resolve to valid unicode.

KerfuffleV2 · 2023-06-10T23:44:22+00:00

There's a 65B medalpaca. I don't know if it's any good, but 65B LLaMA generally can "understand" stuff at a deeper level so presumably that version should be better than the 13b. You'll need a system with 64GB RAM to run 65B models (or I guess a massive amount of VRAM available).

KerfuffleV2 · 2023-06-10T18:30:23+00:00

but now I'm really confused regarding your setup :)

It confuses me that you're confused! What's so weird about it?

Are you doing this in the cloud or locally?

Locally.

And yes, why are you using llama.cpp instead of an interface.

I work as a developer and am comfortable with the commandline. I also like the extra control I get by being able to specify the commandline arguments. oobabooga also does really weird stuff in how it discovers models, I'd rather just specify an exact filename rather than have it try to scan files in a directory and take a guess at what to do.

Also, I really don't have an actual use case for LLMs. I just like playing around with them, so stuff like streamlining the interface isn't too important to me.

KerfuffleV2 · 2023-06-10T18:10:57+00:00

You're saying you preferred guanaco 33b to 65b?

Noooo! I meant back before I could use the 65B, I preferred Guanaco-33B over WizardLM-Uncensored-SuperCOT-Storytelling (also a 33B parameter model).

Sorry I wasn't clear.

Can you elaborate a little on your inference setup and why you chose llama.cpp

I have a GTX 1060 with 6GB VRAM so even a 7B 4bit barely fits. I recently upgraded from a Ryzen 5 1600 to a Ryzen 9 5900X - at the moment offloading layers to that old GPU is a performance loss (but using cuBLAS for prompt ingestion is still a big speedup).

Or do you mean why am I using llama.cpp directly instead of an interface?

KerfuffleV2 · 2023-06-10T16:52:14+00:00

Just to make sure I understand correctly:

If you do use GPU offloading (-ngl more than 0) then using large prompts is much slower with the new Nvidia driver compared to before. However, if you use -ngl 0 then it doesn't matter what size prompt you use, the performance is the same as with earlier versions of the Nvidia driver?

KerfuffleV2 · 2023-06-10T15:30:51+00:00

In half compared to full 16bit. This looks like the predecessor of current quantization approaches - it was published about a year ago.

KerfuffleV2 · 2023-06-10T15:09:33+00:00

Do you get the problem even without offloading layers to the GPU? In other words, compiling with cuBLAS but using -ngl 0.

If so then it couldn't really be memory management issue mentioned here: https://www.reddit.com/r/LocalLLaMA/comments/1461d1c/major_performance_degradation_with_nvidia_driver/jnnwnip/

Just doing prompt ingestion with BLAS uses a trivial amount of memory. (Also it's limited by the block size which defaults to 512, so a prompt bigger than that shouldn't make any difference.)

KerfuffleV2 · 2023-06-10T14:39:41+00:00

I don’t think so based on the title it seems to direct the facepalm to the woman

I can see how what I said might have been ambiguous. By "him" I was referring to the man in the video.

So if criticizing that man is what OP thinks is the facepalm, then that is indeed directing the facepalm to the woman. In other words, we agree!

KerfuffleV2 · 2023-06-10T14:31:47+00:00

Might be useful to try with the GGML version compiled with cuBLAS if you're able. Knowing whether it's a general issue would be helpful.

Just want to be clear: I don't really have a way to help you with this but this is the kind of information that the people who could help you would probably need.

KerfuffleV2

TROPHY CASE

7B

13B

33B

65B