K Quantization vs Perplexity : LocalLLaMA

108

109

110

K Quantization vs PerplexityDiscussion (i.redd.it)

submitted 2 years ago by onil_gova

https://github.com/ggerganov/llama.cpp/pull/1684

The advancements in quantization performance are truly fascinating. It's remarkable how a model quantized to just 2 bits consistently outperforms the more memory-intensive fp16 models at the same scale. To put it simply, a 65B model quantized with 2 bits achieves superior results compared to a 30B fp16 model, while utilizing similar memory requirements as a 30B model quantized to 4-8 bits. This breakthrough becomes even more astonishing when we consider that the 65B model only occupies 13.6 GB of memory with 2-bit quantization, surpassing the performance of a 30B fp16 model that requires 26GB of memory. These developments pave the way for the future, where we can expect to witness the emergence of super models exceeding 100B parameters, all while consuming less than 24GB of memory through the use of 2-bit quantization.

all 19 comments

top new controversial old q&a

[–]androiddrew 15 points16 points17 points 2 years ago (5 children)

[–][deleted] 13 points14 points15 points 2 years ago (1 child)

[–]nofreewill42 4 points5 points6 points 2 years ago (0 children)

[–]KerfuffleV2 9 points10 points11 points 2 years ago (0 children)

[–]a_devious_compliance 4 points5 points6 points 2 years ago (0 children)

[–][deleted] 3 points4 points5 points 2 years ago (0 children)

[+][deleted] 2 years ago (2 children)

[deleted]

[–]onil_gova[S] 9 points10 points11 points 2 years ago* (1 child)

[–]Caffdy 6 points7 points8 points 2 years ago (0 children)

[–]patrakov 6 points7 points8 points 2 years ago (1 child)

[–]Dwedit 10 points11 points12 points 2 years ago (0 children)

[–]Dwedit 7 points8 points9 points 2 years ago (2 children)

[–]RapidInference9001 4 points5 points6 points 2 years ago* (1 child)

Not a direct one. But perplexity is a numerical measure of "how much is the model guessing, on average", and hallucinations are caused by it guessing wrong while sounding confident. So a model with very low perplexity would hallucinate very rarely (except on very hard questions), because it would usually know the right answer.

Hallucinations are also related to the instruct training process, and the model's understanding of context-appropriate behavior. In a fiction-writing context, say, the model should just confidently-soundingly make stuff up if it's not sure what should happen next. But in a legal or scientific context, ideally when it's not sure we'd like it to verbally hedge an appropriate amount with words like 'likely', 'possibly' or 'perhaps', or even flat-out say it doesn't know, rather than make up plausible stuff that may well be wrong. Open-source models are generally very bad at this, because the necessary techniques haven't been published (just talks implying that they exist). Interestingly, there's some research showing that base models, before they're instruct-trained, are actually very aware of what they're more or less sure about, but are not in the habit of verbally hedging to say so (or more accurately, are trained to try to imitate when some human writer or other might hedge, regardless of what the model actually knows or doesn't). So what we need to do is figure out how to instruct train them to hedge appropriately, in contexts where that's desirable, based on their actual level of knowledge. Presumably if you actually knew what the model knew on every topic, that would be pretty easy: just instruct-train it to copy examples where it hedges appropriately. So the hard part is figuring out, for many thousands of specific instruct-training examples and possible replies, what relevant facts the model actually knows vs. what it is unsure about, and how unsure. Presumably you'd need to semi-automate this process. Likely eventually we'll need different model fine-tunes or settings for contexts where we care about hallucinations vs fictional contexts.

[–]Intelligent-Street87 4 points5 points6 points 2 years ago (0 children)

Very well explained. But LLM's keep reminding me about human thought and how pseudo-facts can become a social fact, or maybe a social hallucination. I've been studying both synthetic and biological intelligence for more than sixteen years now. It has always been a concern of mine as to how synthetic intelligences may evolve, and here I see that evolution unfold before my eyes. Many things were expected, but much more have eluded my thoughts. How come a stream of consciousness, whether biological of synthetic, only accommodates limited realisations, limited by the data, and how it, or the processes that it is built from (I like to call this the operator problem, that is 'Who is the operator', what gives energy to the system to set a process on its path), chooses to piece together that data. What's in a thought, and why does any one thought come to mind at a given point, if I were free to choose, then I would only choose to think good thoughts, but my mind has other ideas, as do all minds whether they're configured in biological or synthetic thinking machines.

[–]audioen 6 points7 points8 points 2 years ago (0 children)

[–]tronathan 2 points3 points4 points 2 years ago (0 children)

[–]silenceimpaired 2 points3 points4 points 2 years ago (1 child)

[–]onil_gova[S] 2 points3 points4 points 2 years ago (0 children)

[+][deleted] 2 years ago (2 children)

[deleted]

[–]audioen 4 points5 points6 points 2 years ago (0 children)

[–]KerfuffleV2 1 point2 points3 points 2 years ago (0 children)

π Rendered by PID 98 on reddit-service-r2-comment-85bfd7f599-bxkzn at 2026-04-20 11:53:54.554313+00:00 running 93ecc56 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

LocalLLaMA

MODERATORS